<div style="background-color:#C0392B; text-align: center;color:#19180F; font-size:40px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px">
Build a Text Multi-Class Classification Model using BERT
 </div>

<div style="background-color:#D98880; color:#19180F; text-align: center;font-size:30px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> Introduction & Motivation </div>
<div style="background-color:#D5D9F2; color:#19180F; font-size:20px; font-family:verdana; padding:10px; border: 5px solid #19180F; border-radius:10px "> 
In this notebook we are going to present our DCGAN. Its purpose is to generate
fake images that look like real images, after training on a particular dataset. 
We were interested in GANs because we  thought it would be really interesting to 
dive into the details of training one. For other types of deep learning 
architectures, it can be pretty straightforward to train a network, but that is 
not the case with GANs.
</div>



<div style="background-color:#D5D9F2; color:#19180F; font-size:20px; font-family:verdana; padding:10px; border: 5px solid #19180F; border-radius:10px "> 
Our training was executed on NVIDIA's last generation GPUs, `A100`. because on CPU this took more then 5 hours, And this is for `128x128` images.
We will discuss about scaling up our GAN later.  
</div>




<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Importing modules
    </div>

In [1]:
import re
import torch
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch.nn as nn
from transformers import BertModel
from transformers import BertTokenizer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


In [44]:
lr = 1e-3
seq_len = 20
dropout = 0.5
num_epochs = 10

label_col = "Product"
tokens_path = "Output/tokens.pkl"
labels_path = "Output/labels.pkl"
data_path = "Input/complaints.csv"
model_path = "Output/bert_pre_trained.pth"
text_col_name = "Consumer complaint narrative"
label_encoder_path = "Output/label_encoder.pkl"
product_map = {'Vehicle loan or lease': 'vehicle_loan',
               'Credit reporting, credit repair services, or other personal consumer reports': 'credit_report',
               'Credit card or prepaid card': 'card',
               'Money transfer, virtual currency, or money service': 'money_transfer',
               'virtual currency': 'money_transfer',
               'Mortgage': 'mortgage',
               'Payday loan, title loan, or personal loan': 'loan',
               'Debt collection': 'debt_collection',
               'Checking or savings account': 'savings_account',
               'Credit card': 'card',
               'Bank account or service': 'savings_account',
               'Credit reporting': 'credit_report',
               'Prepaid card': 'card',
               'Payday loan': 'loan',
               'Other financial service': 'others',
               'Virtual currency': 'money_transfer',
               'Student loan': 'loan',
               'Consumer Loan': 'loan',
               'Money transfers': 'money_transfer'}

In [3]:
def save_file(name, obj):
    """
    Function to save an object as pickle file
    """
    with open(name, 'wb') as f:
        pickle.dump(obj, f)


def load_file(name):
    """
    Function to load a pickle object
    """
    return pickle.load(open(name, "rb"))

<div style="background-color:#F1C40F; color:#C0392B; font-size:40px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> I. Text Data Processing</div>

In [11]:
data=pd.read_csv(data_path, delimiter=',', quotechar='"',  engine='python',on_bad_lines='skip')

In [12]:
data.dropna(subset=[text_col_name], inplace=True)

In [13]:
data.replace({label_col: product_map}, inplace=True)

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Encode labels
    </div>

In [15]:
label_encoder = LabelEncoder()
label_encoder.fit(data[label_col])
labels = label_encoder.transform(data[label_col])

In [16]:
save_file(labels_path, labels)
save_file(label_encoder_path, label_encoder)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Process the text column
    </div>

In [32]:
input_text = list(data[text_col_name])

In [33]:
len(input_text)

241440

In [41]:
## convert text to lower case 
input_text=[i.lower() for i in tqdm(input_text)]

## remove punctuations except apostrophe
input_text= [re.sub(r'\{\$\d+\.\d{2}\}',"",i) for i in tqdm(input_text)]
input_text = [re.sub("\d+", "", i) for i in tqdm(input_text)]

## remove more than one consecutive instance of x 
input_text=[re.sub(r'[x]{2,}',"",i) for i in tqdm(input_text)]

## remove multiple spaces with a single space 
input_text=[re.sub(r' +',' ',i) for i in tqdm(input_text)]

## remove '//
input_text=[re.sub(r'[//]','',i) for i in tqdm(input_text)]

  0%|          | 0/241440 [00:00<?, ?it/s]

100%|██████████| 241440/241440 [00:00<00:00, 895226.24it/s]
100%|██████████| 241440/241440 [00:00<00:00, 793735.19it/s]
100%|██████████| 241440/241440 [00:03<00:00, 73074.98it/s]
100%|██████████| 241440/241440 [00:02<00:00, 96332.19it/s] 
100%|██████████| 241440/241440 [00:09<00:00, 25077.84it/s]
100%|██████████| 241440/241440 [00:00<00:00, 865393.96it/s]


In [39]:
## Tokenize text 
tokenizer=BertTokenizer.from_pretrained('bert-base-cased')

In [40]:
input_text[0]

'i contacted ally on friday after falling behind on payments due to being out of work for a short period of time due to an illness. i chated with a representative after logging into my account regarding my opitions to ensure i protect my credit and bring my account current. \n\nshe advised me that before an extenstion could be done, i had to make a payment in the amount of . i reviewed my finances, as i am playing catch up on all my bills and made this payment on monday . this rep advised me, once this payment posts to my account to contact ally back for an extention or to have a payment deffered to the end of my loan. \n\nwith this in mind, i contacted ally again today and chatted with . i explained all of the above and the information i was provided when i chatted with the rep last week. she asked several questions and advised me that a one or two month extensiondeffered payment could be done however partial payment is needed! what? she advised me or there abouts would be due within 

In [52]:
## a tokenization example
sample_tokens = tokenizer(input_text[0], padding=True,
                          max_length = seq_len, truncation = True,
                          return_tensors='pt')

In [49]:
sample_tokens

{'input_ids': tensor([[  101,   178, 12017, 11989,  1113,   175, 22977,  1183,  1170,  4058,
          1481,  1113, 10772,  1496,  1106,  1217,  1149,  1104,  1250,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [50]:
sample_tokens["input_ids"]

tensor([[  101,   178, 12017, 11989,  1113,   175, 22977,  1183,  1170,  4058,
          1481,  1113, 10772,  1496,  1106,  1217,  1149,  1104,  1250,   102]])

In [51]:
sample_tokens["attention_mask"]

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [53]:
## tokenization of all reviews in the data 
tokens = [tokenizer(i, padding="max_length", max_length=seq_len, 
                    truncation=True, return_tensors="pt") 
         for i in tqdm(input_text)]

100%|██████████| 241440/241440 [09:07<00:00, 441.09it/s]



<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Save tokens
    </div>


<div style="background-color:#D5D9F2; color:#19180F; font-size:20px; font-family:verdana; padding:10px; border: 5px solid #19180F; border-radius:10px "> 
Now that we have our tokens prepared as input for our model, I choose to save them on disk. This will be useful if I need to use them for another task with the same model (BERT) in the future. Or if some one needs to reproduce this project, he can just use my tokens directly (if he is using the same model of course). 
</div>




In [54]:
save_file(tokens_path, tokens)

<div style="background-color:#F1C40F; color:#C0392B; font-size:40px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> II. Create BERT Model</div>


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
ClassifierBert class 
    </div>

In [56]:
class BertClassifier(nn.Module):

    def __init__(self, dropout, num_classes):
        super(BertClassifier, self).__init()
        self.bert= BertModel.from_pretrained('bert-base-cased')
        for param in self.bert.parameters() : 
            param.required_grad=False
        self.dropout = nn.Dropout(dropout)
        self.linear(1,num_classes)
        self.activation = nn.ReLU()

    def forward(self, input_ids , attention_mask):
        _, bert_output = self.bert(input_ids=input_ids,
                                   attention_mask=attention_mask,
                                   return_dict=False)
        dropout_output = self.activation(self.dropout(bert_output))
        final_output=self.linear(dropout_output)
        return final_output


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Pytorch Dataset
    </div>

In [57]:
class TextDataset(torch.utils.data.Dataset):
    
    def __init__(self, tokens, labels):
        self.tokens = tokens
        self.labels = labels
        
    def __len__(self):
        return len(self.tokens)
    
    def __getitem__(self, idx):
        return self.labels[idx], self.tokens[idx]


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Function to train the model
    </div>

In [58]:
def train(train_loader,valid_loader,model, criterion , optimizer,
          device,num_epochs,model_path):
    """
    Function to train the model
    :param train_loader: Data loader for train dataset
    :param valid_loader: Data loader for validation dataset
    :param model: Model object
    :param criterion: Loss function
    :param optimizer: Optimizer
    :param device: CUDA or CPU
    :param num_epochs: Number of epochs
    :param model_path: Path to save the model
    """
    ## we initialize the loss big value
    best_loss=1e8
    for i in range(num_epochs):
        print(f"Epoch {i+1} of {num_epochs}")
        # for each epoch we create a list of losses (train and valid)
        valid_loss, train_loss = [],[]
        model.train() 
        # Train loop 
        for batch_labels, batch_data in tqdm(train_loader):
            input_ids=batch_data["input_ids"]
            attention_mask = batch_data["attention_mask"]
            # Move data to GPU if available 
            batch_labels=batch_labels.to(device)
            input_ids=input_ids.to(device)
            attention_mask=attention_mask.to(device)
            input_ids=torch.squeeze(input_ids,1)  # train_loader return 3d tensors and we need 2D
            # Forward pass ::
            batch_output=model(input_ids,attention_mask)
            batch_output = torch.squeeze(batch_output)
            # Calculate loss 
            loss = criterion(batch_output, batch_labels)
            # add batch_loss to train_loss list 
            train_loss.append(loss)
            # preparing the backward 
            optimizer.zero_grad() # for initializing the grads for each batch !
            # backward pass   : 
            loss.backward()
            # Gradient update step 
            optimizer.step()
        ## validation 
        model.eval() 
        ## Validation loop 
        for batch_labels, batch_data in tqdm(valid_loader):
            input_ids = batch_data["input_ids"]
            attention_mask = batch_data["attention_mask"]
            # Move data to GPU if available
            batch_labels = batch_labels.to(device)
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            input_ids = torch.squeeze(input_ids, 1)
            # Forward pass
            batch_output = model(input_ids, attention_mask)
            batch_output = torch.squeeze(batch_output)
            # Calculate loss
            ###batch_labels = batch_labels.type(torch.LongTensor)
            loss = criterion(batch_output, batch_labels)
            valid_loss.append(loss.item())
        ## compute the mean of train & valid loss for the batch 
        t_loss=np.mean(train_loss)
        v_loss=np.mean(valid_loss)
        print(f"Train Loss: {t_loss}, Validation Loss: {v_loss}")
        ## verificaiton of the best validation loss for all epochs 
        if v_loss<best_loss : 
            best_loss = v_loss
            # save current model as the best model 
            torch.save(model.state_dict(),model_path)
        print(f"Best Validation Loss: {best_loss}")