#### Notebook
Text Classification using BERT(HuggingFace Transformers) in PyTorch. In this, We shall leverage a pre-trained BERT model from HuggingFace

Dataset Link - https://www.kaggle.com/datasets/shivamkushwaha/bbc-full-text-document-classification

In [1]:
#neccessary imports
import numpy as np
import pandas as pd
import glob 
import io
import os
import warnings
warnings.filterwarnings("ignore")
!pip install transformers --quiet
from transformers import BertTokenizer
from transformers import BertModel
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# os.chdir('/content/drive/MyDrive/Datasets')
# !unzip archive.zip

In [2]:
dataset_path = "/content/drive/MyDrive/Datasets/bbc-fulltext (document classification)/bbc"

In [3]:
dir_path = []
for dirname,_, filenames in os.walk(dataset_path):
    dir_path.append(dirname)

print(dir_path)

['/content/drive/MyDrive/Datasets/bbc-fulltext (document classification)/bbc', '/content/drive/MyDrive/Datasets/bbc-fulltext (document classification)/bbc/business', '/content/drive/MyDrive/Datasets/bbc-fulltext (document classification)/bbc/entertainment', '/content/drive/MyDrive/Datasets/bbc-fulltext (document classification)/bbc/politics', '/content/drive/MyDrive/Datasets/bbc-fulltext (document classification)/bbc/sport', '/content/drive/MyDrive/Datasets/bbc-fulltext (document classification)/bbc/tech']


In [4]:
#converting text to dataframe
def text_to_df(dataset_path):
    #creating a dataframe
    df = pd.DataFrame(columns=['NEWS','CATEGORY'])
    text,label = [],[]
    for dir_path in dataset_path:
        text_files_path = sorted(glob.glob(os.path.join(dir_path,"*.txt")))
        for text_path in text_files_path:
            with io.open(text_path,'r',encoding='utf-8', errors='ignore') as txt_file:
                text.append(txt_file.read())
                label.append(dir_path.split('/')[-1])

    df['NEWS']=text
    df['CATEGORY']=label

    return df

In [5]:
df = text_to_df(dir_path[1:])

In [6]:
df

Unnamed: 0,NEWS,CATEGORY
0,Ad sales boost Time Warner profit\n\nQuarterly...,business
1,Dollar gains on Greenspan speech\n\nThe dollar...,business
2,Yukos unit buyer faces loan claim\n\nThe owner...,business
3,High fuel prices hit BA's profits\n\nBritish A...,business
4,Pernod takeover talk lifts Domecq\n\nShares in...,business
...,...,...
2220,BT program to beat dialler scams\n\nBT is intr...,tech
2221,Spam e-mails tempt net shoppers\n\nComputer us...,tech
2222,Be careful how you code\n\nA new European dire...,tech
2223,US cyber security chief resigns\n\nThe man mak...,tech


In [7]:
news = df['NEWS'].values.tolist()
category = df['CATEGORY'].values.tolist()

#### About BERT
1. BERT represents Bidirectional Encoder Representations from Transformers.
2. BERT architecture consists of several Transformer encoders stacked together where each transformer encoder is comprised of two sub layers - self attention and a feed forward layer.
3. There are two flavours of BERT models - BERT base and BERT large.
4. BERT base consists of 12 transformer encoders, 12 attention head and 768 hidden size.
5. BERT large conssits fo 24 transformer encoders, 16 attention heads and 1024 hidden size.
6. BERT's powerful language model since it's trained on unlabeled data extracted from BooksCorpus and from Wikipedia.
7. BERT learns from sequence of words from Left to right and right to left.
8. BERT model's input is a sequence of tokens where each token has
 > [CLS] token which is the beginning of each token. It mean Classification token.
 > [SEP] token is for next sentence prediction or Q-A answering task.
 > Maximum size of tokens that can be fed into BERT is 512, If the tokens in a sequence are less than 512, We use [PAD] tokens.
9. BERT model's output is an embedding vector of size 768 also know as pooled output. It's mainly used for fine tuning tasks like Text classification, NSP, NER or Q&A. The embedding vector of size 768 from [CLS] token is fed to any classifier for the aforementioned task.

In [8]:
#analysing length of token
suitable_doc = 0
for i in range(1,len(news),10):
    #print("Number of tokens in {}th sample is {} ".format(i,len(news[i].split(' '))))  #max size for BERT input is 512
    if len(news[i].split(' '))<512:
        suitable_doc+=1
    else:
        print("Number of tokens in {}th sample is {} ".format(i,len(news[i].split(' ')))) 

print("Percentage of documents that are suitable for BERT input are {} %".format(suitable_doc/len(news)*100))

Number of tokens in 241th sample is 604 
Number of tokens in 251th sample is 599 
Number of tokens in 261th sample is 539 
Number of tokens in 331th sample is 537 
Number of tokens in 531th sample is 592 
Number of tokens in 631th sample is 736 
Number of tokens in 651th sample is 554 
Number of tokens in 771th sample is 829 
Number of tokens in 881th sample is 627 
Number of tokens in 961th sample is 526 
Number of tokens in 1141th sample is 564 
Number of tokens in 1151th sample is 577 
Number of tokens in 1231th sample is 805 
Number of tokens in 1291th sample is 687 
Number of tokens in 1301th sample is 976 
Number of tokens in 1371th sample is 675 
Number of tokens in 1381th sample is 793 
Number of tokens in 1491th sample is 644 
Number of tokens in 1931th sample is 606 
Number of tokens in 1961th sample is 619 
Number of tokens in 1991th sample is 571 
Number of tokens in 2011th sample is 1324 
Number of tokens in 2021th sample is 915 
Number of tokens in 2031th sample is 612 
N

In all the samples where input_size is greater than 512, We shall truncate the input and pass further. However, It is not recommended.

List of various tokenizers can be found from https://huggingface.co/models

In [9]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [10]:
tokenised_text = tokenizer(news[2091],padding='max_length', max_length=512, truncation=True, return_tensors="pt")


The tokenised text is a dictionary with following keys which are elaborated.
1. input_ids - It is the id representation of each token.
2. token_type_ids - It is a binary mask that identifies in which sequence a token belongs. It's an optional input for various tasks.
3. attention_mask - It is a binary mask which identifies whether a token is a real word or padding. If the token contains [CLS],[SEP] or real word then mask is 1 else for [PAD] its 9


In [11]:
tokenised_text.keys()


dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [12]:
tokenised_text['input_ids']

tensor([[  101,  9980,  2489,  2015,  3156,  4007, 13979,  3274,  5016,  9980,
          2758,  3156,  1997,  2049,  4007, 13979,  2097,  2022,  2207,  2046,
          1996,  2330,  2458,  2451,  1012,  1996,  2693,  2965,  9797,  2097,
          2022,  2583,  2000,  2224,  1996,  6786,  2302,  7079,  2005,  1037,
         11172,  2013,  1996,  2194,  1012,  9980,  2649,  1996,  3357,  2004,
          1037,  1000,  2047,  3690,  1000,  1999,  2129,  2009,  9411,  2007,
          7789,  3200,  1998,  5763,  2582, 13979,  2052,  2022,  2081, 10350,
          2800,  1012,  1996, 13979,  2421,  4007,  2005,  1037,  2846,  1997,
          6078,  1010,  2164,  3793,  5038,  1998,  7809,  2968,  1012,  3151,
          2974,  2449,  3343,  2003,  2000, 25933,  4757, 13979,  1998,  2750,
          9980,  1005,  1055,  8874,  1996,  2194,  4247,  2000,  3582,  2023,
          2799,  1012,  9980,  2001,  4379,  1017,  1010, 24568, 13979,  1999,
          2432,  1010,  2062,  2084,  2151,  2060,  

In [13]:
#getting back text from input_ids
example_text = tokenizer.decode(tokenised_text.input_ids[0])
print(example_text)

[CLS] ibm frees 500 software patents computer giant ibm says 500 of its software patents will be released into the open development community. the move means developers will be able to use the technologies without paying for a licence from the company. ibm described the step as a " new era " in how it dealt with intellectual property and promised further patents would be made freely available. the patents include software for a range of practices, including text recognition and database management. traditional technology business policy is to amass patents and despite ibm's announcement the company continues to follow this route. ibm was granted 3, 248 patents in 2004, more than any other firm in the us, the new york times reports. for each of the past 12 years ibm has been granted more us patents than any other company. ibm has received 25, 772 us patents in that period and reportedly has more than 40, 000 current patents. in a statement, dr john e. kelly, ibm senior vice president, t

In [14]:
#building the dataset
label_mapper = {'business':0, 'entertainment':1, 'sport':2, 'tech':3, 'politics':4}
inverse_label_mapper = {} #for prediction
for k,v in label_mapper.items():
    inverse_label_mapper[v] = k

print(inverse_label_mapper)


{0: 'business', 1: 'entertainment', 2: 'sport', 3: 'tech', 4: 'politics'}


In [15]:
class TextClassificationDataset(Dataset):
    def __init__(self,news,labels):
        self.tokenizer= BertTokenizer.from_pretrained('bert-base-cased')
        self.tokenized_news = []
        for doc_text in news:
            tokenised_doc = self.tokenizer(doc_text,padding='max_length', max_length=512, truncation=True, return_tensors="pt")
            self.tokenized_news.append(tokenised_doc)
        
        self.encoded_labels = []
        for label in labels:
            self.encoded_labels.append(label_mapper[label])

    def __len__(self):
        return len(self.tokenized_news)

    def __getitem__(self, idx):
        tokenised_news = self.tokenized_news[idx]
        encoded_label = self.encoded_labels[idx]
        
        item = {'input_tokens':tokenised_news,'output_labels':np.array(encoded_label)}
        return item

In [16]:
#splitting into train ,val
X_train, X_test, y_train, y_test = train_test_split(news, category, test_size=0.10, random_state=42)

In [17]:
train_dataset = TextClassificationDataset(X_train,y_train)
val_dataset = TextClassificationDataset(X_test,y_test)

In [18]:
train_dataset.__getitem__(5)

{'input_tokens': {'input_ids': tensor([[  101, 22163,  2705,  2274, 10217,  4829,  1109,  2705,  3275,  1104,
           1103, 22163, 27482,  1144,  1678,  1285,   118,  1106,   118,  1285,
           1654,  1104,  1157,  7851,  1610,  1671,  1107,  1126,  3098,  1106,
           1885,  1122,  1213,   119, 15720,  1345,  1988,  1673,  1144,  2125,
           6717,  3177, 10212,  1112,  2705,  3275,  1104, 22163, 12983,   117,
           1114,  1828,  3177, 10212,  2128,  1103,  1419,   119,  1828,  1345,
           1988,  1673,  3316,  1103,  2223,  1246,  1104,  1103,  1671,   118,
           1134,  1110,  2637,  1106,  1294,   170,  4645,  1306, 27772,   113,
            109,   122,  1830,  1179,   114,  2445,  1107,  1516,   118,  1107,
           1112,  1242,  1201,   119, 22163,  1223,  3365, 17747,  1103,  2319,
           1107,  1980,  1314,  1214,   117,  3195,  3596,  3813,   119,  1109,
           1610,  1671,  1144,  1189,  1126,  3389,  2445,  1107,  1421,  1104,
          

In [19]:
train_loader = DataLoader(train_dataset,shuffle=True,pin_memory=True,batch_size=64)
val_loader = DataLoader(val_dataset,shuffle=True,pin_memory=True,batch_size=64)


In [20]:
for batch in train_loader:
    print(batch)
    break

{'input_tokens': {'input_ids': tensor([[[  101, 12646,  8755,  ...,  2846,  1112,   102]],

        [[  101, 11644,  6439,  ...,     0,     0,     0]],

        [[  101,  7268,  2642,  ...,  2616,  2182,   102]],

        ...,

        [[  101,   148,  5114,  ...,     0,     0,     0]],

        [[  101,  1993, 20335,  ...,     0,     0,     0]],

        [[  101,  6940,  3606,  ...,     0,     0,     0]]]), 'token_type_ids': tensor([[[0, 0, 0,  ..., 0, 0, 0]],

        [[0, 0, 0,  ..., 0, 0, 0]],

        [[0, 0, 0,  ..., 0, 0, 0]],

        ...,

        [[0, 0, 0,  ..., 0, 0, 0]],

        [[0, 0, 0,  ..., 0, 0, 0]],

        [[0, 0, 0,  ..., 0, 0, 0]]]), 'attention_mask': tensor([[[1, 1, 1,  ..., 1, 1, 1]],

        [[1, 1, 1,  ..., 0, 0, 0]],

        [[1, 1, 1,  ..., 1, 1, 1]],

        ...,

        [[1, 1, 1,  ..., 0, 0, 0]],

        [[1, 1, 1,  ..., 0, 0, 0]],

        [[1, 1, 1,  ..., 0, 0, 0]]])}, 'output_labels': tensor([3, 0, 0, 2, 1, 0, 3, 2, 2, 3, 1, 3, 0, 1, 4, 4, 2, 1

In [21]:
#creating model
class BertFineTuner(nn.Module):
    def __init__(self):
        super(BertFineTuner,self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased')
        #freezing model
        for p in self.bert.parameters():
                p.requires_grad = False
        self.dropout = nn.Dropout(0.2)
        self.linear = nn.Linear(768,5)
        self.relu = nn.ReLU()

    def forward(self,input_id,attention_mask):
        # 768 dim pooled output
        _,pooled_output = self.bert(input_ids = input_id,attention_mask=attention_mask,return_dict=False)
        out = self.dropout(pooled_output)
        out = self.linear(out)
        out = self.relu(out)

        return out
        

In [24]:

# Training model
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = BertFineTuner().to(device)
optimizer = optim.Adam(model.parameters(),lr=0.001)
criterion = nn.CrossEntropyLoss(ignore_index=0)
num_epochs =20

for epoch in range(num_epochs):
    model.train()
    for i,batch in enumerate(train_loader):
        input_ids= batch['input_tokens']['input_ids'].squeeze(1).to(device)
        attention_masks = batch['input_tokens']['attention_mask'].to(device)
        output_label = batch['output_labels'].to(device)
        output = model(input_ids,attention_masks)
        train_loss = criterion(output,output_label.long())
        train_acc = (output.argmax(dim=1) == output_label).sum().item()

        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()
    

    #validation
    with torch.no_grad():
        model.eval()
        for i,batch in enumerate(val_loader):
            input_ids= batch['input_tokens']['input_ids'].squeeze(1).to(device)
            attention_masks = batch['input_tokens']['attention_mask'].to(device)
            output_label = batch['output_labels'].to(device)
            output = model(input_ids,attention_masks)
            val_loss = criterion(output,output_label.long())
            val_acc = (output.argmax(dim=1) == output_label).sum().item()
    print("Epoch - {}, Train Loss - {}, Train Accuracy -{}, Val Loss - {}, Val Accuracy - {}".format(epoch,train_loss.item(),train_acc,val_loss.item(),val_acc))

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch - 0, Train Loss - 1.6241945028305054, Train Accuracy -3, Val Loss - 1.6002681255340576, Val Accuracy - 5
Epoch - 1, Train Loss - 1.3529332876205444, Train Accuracy -6, Val Loss - 1.8286601305007935, Val Accuracy - 2
Epoch - 2, Train Loss - 1.536146879196167, Train Accuracy -4, Val Loss - 1.4122750759124756, Val Accuracy - 9
Epoch - 3, Train Loss - 1.682178258895874, Train Accuracy -4, Val Loss - 1.4606709480285645, Val Accuracy - 9
Epoch - 4, Train Loss - 1.2396433353424072, Train Accuracy -9, Val Loss - 1.3118032217025757, Val Accuracy - 11


In [40]:
def predict(text):
  tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
  # tokenize text and generate input_ids and attn_mask
  token = tokenizer(text,padding='max_length', max_length=512, truncation=True, return_tensors="pt")
  token['input_ids'] = token['input_ids'].squeeze(1).to(device)
  token['attention_mask'] = token['attention_mask'].to(device)
  with torch.no_grad():
    logits = model(token['input_ids'],token['attention_mask'])
  #applying softmax
  probs = torch.nn.functional.softmax(logits,dim=1).cpu().numpy()
  category = np.argmax(probs,axis=1)[0]
  return inverse_label_mapper[category]



In [42]:
predict("The plain green Norway spruce is displayed in the gallery's foyer. Its light bulb adornments are dimmed, ordinary domestic ones joined together with string. The plates decorating the branches will be auctioned off for the children's charity ArtWorks. Wentworth worked as an assistant to sculptor Henry Moore in the late 1960s. His reputation as a sculptor grew in the 1980s, while he has been one of the most influential teachers during the last two decades. Wentworth is also known for his photography of mundane, everyday subjects such as a cigarette packet jammed under the wonky leg of a table")

'sport'