<div style="background-color:#C0392B; text-align: center;color:#19180F; font-size:40px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px">
Build a Text Multi-Class Classification Model using BERT
 </div>

<div style="background-color:#D98880; color:#19180F; text-align: center;font-size:30px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> Introduction & Motivation </div>
<div style="background-color:#D5D9F2; color:#19180F; font-size:20px; font-family:verdana; padding:10px; border: 5px solid #19180F; border-radius:10px "> 
In this notebook we are going to present our DCGAN. Its purpose is to generate
fake images that look like real images, after training on a particular dataset. 
We were interested in GANs because we  thought it would be really interesting to 
dive into the details of training one. For other types of deep learning 
architectures, it can be pretty straightforward to train a network, but that is 
not the case with GANs.
</div>



<div style="background-color:#D5D9F2; color:#19180F; font-size:20px; font-family:verdana; padding:10px; border: 5px solid #19180F; border-radius:10px "> 
Our training was executed on NVIDIA's last generation GPUs, `A100`. because on CPU this took more then 5 hours, And this is for `128x128` images.
We will discuss about scaling up our GAN later.  
</div>




<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Importing modules
    </div>

In [4]:
import re
import torch
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch.nn as nn
from transformers import BertModel
from transformers import BertTokenizer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
lr = 1e-3
seq_len = 20
dropout = 0.5
num_epochs = 10

label_col = "Product"
tokens_path = "Output/tokens.pkl"
labels_path = "Output/labels.pkl"
data_path = "Input/complaints.csv"
model_path = "Output/bert_pre_trained.pth"
text_col_name = "Consumer complaint narrative"
label_encoder_path = "Output/label_encoder.pkl"
product_map = {'Vehicle loan or lease': 'vehicle_loan',
               'Credit reporting, credit repair services, or other personal consumer reports': 'credit_report',
               'Credit card or prepaid card': 'card',
               'Money transfer, virtual currency, or money service': 'money_transfer',
               'virtual currency': 'money_transfer',
               'Mortgage': 'mortgage',
               'Payday loan, title loan, or personal loan': 'loan',
               'Debt collection': 'debt_collection',
               'Checking or savings account': 'savings_account',
               'Credit card': 'card',
               'Bank account or service': 'savings_account',
               'Credit reporting': 'credit_report',
               'Prepaid card': 'card',
               'Payday loan': 'loan',
               'Other financial service': 'others',
               'Virtual currency': 'money_transfer',
               'Student loan': 'loan',
               'Consumer Loan': 'loan',
               'Money transfers': 'money_transfer'}

In [6]:
def save_file(name, obj):
    """
    Function to save an object as pickle file
    """
    with open(name, 'wb') as f:
        pickle.dump(obj, f)


def load_file(name):
    """
    Function to load a pickle object
    """
    return pickle.load(open(name, "rb"))

<div style="background-color:#F1C40F; color:#C0392B; font-size:40px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> I. Text Data Processing</div>

In [7]:
data=pd.read_csv(data_path, delimiter=',', quotechar='"',  engine='python',on_bad_lines='skip')

In [8]:
data.dropna(subset=[text_col_name], inplace=True)

In [9]:
data.replace({label_col: product_map}, inplace=True)

<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Encode labels
    </div>

In [10]:
label_encoder = LabelEncoder()
label_encoder.fit(data[label_col])
labels = label_encoder.transform(data[label_col])

In [11]:
save_file(labels_path, labels)
save_file(label_encoder_path, label_encoder)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Process the text column
    </div>

In [12]:
input_text = list(data[text_col_name])

In [13]:
len(input_text)

241440

In [14]:
## convert text to lower case 
input_text=[i.lower() for i in tqdm(input_text)]

## remove punctuations except apostrophe
input_text= [re.sub(r'\{\$\d+\.\d{2}\}',"",i) for i in tqdm(input_text)]
input_text = [re.sub("\d+", "", i) for i in tqdm(input_text)]

## remove more than one consecutive instance of x 
input_text=[re.sub(r'[x]{2,}',"",i) for i in tqdm(input_text)]

## remove multiple spaces with a single space 
input_text=[re.sub(r' +',' ',i) for i in tqdm(input_text)]

## remove '//
input_text=[re.sub(r'[//]','',i) for i in tqdm(input_text)]

100%|██████████| 241440/241440 [00:00<00:00, 789488.35it/s]
100%|██████████| 241440/241440 [00:00<00:00, 561876.15it/s]
100%|██████████| 241440/241440 [00:03<00:00, 69203.93it/s]
100%|██████████| 241440/241440 [00:03<00:00, 78467.44it/s]
100%|██████████| 241440/241440 [00:09<00:00, 25282.87it/s]
100%|██████████| 241440/241440 [00:00<00:00, 551770.24it/s]


In [15]:
## Tokenize text 
tokenizer=BertTokenizer.from_pretrained('bert-base-cased')

In [16]:
input_text[0]

'i contacted ally on friday  after falling behind on payments due to being out of work for a short period of time due to an illness. i chated with a representative after logging into my account regarding my opitions to ensure i protect my credit and bring my account current. \n\nshe advised me that before an extenstion could be done, i had to make a payment in the amount of . i reviewed my finances, as i am playing catch up on all my bills and made this payment on monday . this rep advised me, once this payment posts to my account to contact ally back for an extention or to have a payment deffered to the end of my loan. \n\nwith this in mind, i contacted ally again today and chatted with . i explained all of the above and the information i was provided when i chatted with the rep last week. she asked several questions and advised me that a one or two month extensiondeffered payment could be done however partial payment is needed! what? she advised me or there abouts would be due within

In [17]:
## a tokenization example
sample_tokens = tokenizer(input_text[0], padding=True,
                          max_length = seq_len, truncation = True,
                          return_tensors='pt')

In [18]:
sample_tokens

{'input_ids': tensor([[  101,   178, 12017, 11989,  1113,   175, 22977,  1183,  1170,  4058,
          1481,  1113, 10772,  1496,  1106,  1217,  1149,  1104,  1250,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [19]:
sample_tokens["input_ids"]

tensor([[  101,   178, 12017, 11989,  1113,   175, 22977,  1183,  1170,  4058,
          1481,  1113, 10772,  1496,  1106,  1217,  1149,  1104,  1250,   102]])

In [20]:
sample_tokens["attention_mask"]

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [21]:
## tokenization of all reviews in the data 
tokens = [tokenizer(i, padding="max_length", max_length=seq_len, 
                    truncation=True, return_tensors="pt") 
         for i in tqdm(input_text)]

100%|██████████| 241440/241440 [09:10<00:00, 438.43it/s]



<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Save tokens
    </div>


<div style="background-color:#D5D9F2; color:#19180F; font-size:20px; font-family:verdana; padding:10px; border: 5px solid #19180F; border-radius:10px "> 
Now that we have our tokens prepared as input for our model, I choose to save them on disk. This will be useful if I need to use them for another task with the same model (BERT) in the future. Or if some one needs to reproduce this project, he can just use my tokens directly (if he is using the same model of course). 
</div>




In [22]:
save_file(tokens_path, tokens)

<div style="background-color:#F1C40F; color:#C0392B; font-size:40px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> II. Create BERT Model</div>


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
ClassifierBert class 
    </div>

In [23]:
class BertClassifier(nn.Module):
    
    def __init__(self, dropout, num_classes):
        super(BertClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased')
        for param in self.bert.parameters():
            param.required_grad = False
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, num_classes)
        self.activation = nn.ReLU()
    
    def forward(self, input_ids, attention_mask):
        _, bert_output = self.bert(input_ids=input_ids,
                                  attention_mask=attention_mask,
                                  return_dict=False)
        dropout_output = self.activation(self.dropout(bert_output))
        final_output = self.linear(dropout_output)
        return final_output


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Pytorch Dataset
    </div>

In [24]:
class TextDataset(torch.utils.data.Dataset):
    
    def __init__(self, tokens, labels):
        self.tokens = tokens
        self.labels = labels
        
    def __len__(self):
        return len(self.tokens)
    
    def __getitem__(self, idx):
        return self.labels[idx], self.tokens[idx]


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Function to train the model
    </div>

In [36]:
import torch
import numpy as np
from tqdm import tqdm

def train(train_loader, valid_loader, model, criterion, optimizer, device, num_epochs, model_path):
    """
    Function to train the model
    :param train_loader: Data loader for train dataset
    :param valid_loader: Data loader for validation dataset
    :param model: Model object
    :param criterion: Loss function
    :param optimizer: Optimizer
    :param device: CUDA or CPU
    :param num_epochs: Number of epochs
    :param model_path: Path to save the model
    """
    # We initialize the best loss with a large value
    best_loss = 1e8

    for epoch in range(num_epochs):
        print(f"Epoch {epoch+1} of {num_epochs}")
        # For each epoch, create lists to store losses
        train_loss = []
        valid_loss = []
        
        model.train()
        
        # Train loop
        for batch_labels, batch_data in tqdm(train_loader):
            input_ids = batch_data["input_ids"]
            attention_mask = batch_data["attention_mask"]
            
            # Move data to GPU if available
            batch_labels = batch_labels.to(device)
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            input_ids = torch.squeeze(input_ids, 1)  # train_loader returns 3D tensors, we need 2D
            
            # Forward pass
            batch_output = model(input_ids, attention_mask)
            batch_output = torch.squeeze(batch_output)
            
            # Calculate loss
            loss = criterion(batch_output, batch_labels)
            
            # Add batch_loss to train_loss list
            train_loss.append(loss.item())
            
            # Prepare for backward pass
            optimizer.zero_grad()  # Reset the gradients for each batch
            
            # Backward pass
            loss.backward()
            
            # Gradient update step
            optimizer.step()
        
        model.eval()
        
        # Validation loop
        with torch.no_grad():
            for batch_labels, batch_data in tqdm(valid_loader):
                input_ids = batch_data["input_ids"]
                attention_mask = batch_data["attention_mask"]
                
                # Move data to GPU if available
                batch_labels = batch_labels.to(device)
                input_ids = input_ids.to(device)
                attention_mask = attention_mask.to(device)
                input_ids = torch.squeeze(input_ids, 1)
                
                # Forward pass
                batch_output = model(input_ids, attention_mask)
                batch_output = torch.squeeze(batch_output)
                
                # Calculate loss
                loss = criterion(batch_output, batch_labels)
                
                # Add batch_loss to valid_loss list
                valid_loss.append(loss.item())
        
        # Compute the mean of train & valid loss for the epoch
        t_loss = np.mean(train_loss)
        v_loss = np.mean(valid_loss)
        
        print(f"Train Loss: {t_loss}, Validation Loss: {v_loss}")
        
        # Check if this is the best validation loss we've seen so far
        if v_loss < best_loss:
            best_loss = v_loss
            # Save current model as the best model
            torch.save(model.state_dict(), model_path)
        
        print(f"Best Validation Loss: {best_loss}")



<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Function to test the model
    </div>

In [26]:
def test(test_loader, model, criterion, device):
    """
    Function to test the model
    :param test_loader: Data loader for test dataset
    :param model: Model object
    :param criterion: Loss function
    :param device: CUDA or CPU
    """
    model.eval()
    test_loss = []
    test_accu = []
    for batch_labels, batch_data in tqdm(test_loader):
        input_ids = batch_data["input_ids"]
        attention_mask = batch_data["attention_mask"]
        # Move data to GPU if available
        batch_labels = batch_labels.to(device)
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        input_ids = torch.squeeze(input_ids, 1)
        # Forward pass
        batch_output = model(input_ids, attention_mask)
        batch_output = torch.squeeze(batch_output)
        # Calculate loss
        ###batch_labels = batch_labels.type(torch.LongTensor)
        loss = criterion(batch_output, batch_labels)
        test_loss.append(loss.item())
        batch_preds = torch.argmax(batch_output, axis=1)
        # Move predictions to CPU
        if torch.cuda.is_available():
            batch_labels = batch_labels.cpu()
            batch_preds = batch_preds.cpu()
        # Compute accuracy
        test_accu.append(accuracy_score(batch_labels.detach().
                                        numpy(),
                                        batch_preds.detach().
                                        numpy()))
    test_loss = np.mean(test_loss)
    test_accu = np.mean(test_accu)
    print(f"Test Loss: {test_loss}, Test Accuracy: {test_accu}")

<div style="background-color:#F1C40F; color:#C0392B; font-size:40px; font-family:Arial; padding:10px; border: 5px solid #19180F; border-radius:10px"> III. Train The  Model</div>


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Load files    </div>

In [27]:
tokens = load_file(tokens_path)
labels = load_file(labels_path)
label_encoder = load_file(label_encoder_path)
num_classes = len(label_encoder.classes_)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Split data into train, validatin and test     </div>

In [28]:
X_train, X_test, y_train, y_test = train_test_split(tokens, labels,
                                                   test_size=0.2)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, 
                                                      y_train,
                                                     test_size=0.25)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Create Pytorch datasets     </div>

In [29]:
train_dataset = TextDataset(X_train, y_train)
valid_dataset = TextDataset(X_valid, y_valid)
test_dataset = TextDataset(X_test, y_test)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Create Data Loaders    </div>

In [30]:
train_loader = torch.utils.data.DataLoader(train_dataset,
                                           batch_size=16,
                                           shuffle=True,
                                           drop_last=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset,
                                           batch_size=16)
test_loader = torch.utils.data.DataLoader(test_dataset, 
                                         batch_size=16)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
INstantiate model object    </div>

In [31]:
device = torch.device("cuda:0" if torch.cuda.is_available()
                     else "cpu")

In [32]:
model = BertClassifier(dropout, num_classes)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Define loss function and optimizer    </div>

In [33]:
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Move the model to GPU if available    </div>

In [34]:
if torch.cuda.is_available():
    model = model.cuda()
    criterion = criterion.cuda()


<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Train loop   </div>

In [37]:
train(train_loader, valid_loader, model, criterion, optimizer,
     device, num_epochs, model_path)

Epoch 1 of 10


100%|██████████| 9054/9054 [05:02<00:00, 29.95it/s]
100%|██████████| 3018/3018 [00:23<00:00, 131.12it/s]


Train Loss: 1.5812722926164058, Validation Loss: 1.5352410966193464
Best Validation Loss: 1.5352410966193464
Epoch 2 of 10


100%|██████████| 9054/9054 [05:10<00:00, 29.21it/s]
100%|██████████| 3018/3018 [00:22<00:00, 131.31it/s]


Train Loss: 1.5798125248464074, Validation Loss: 1.55237182652895
Best Validation Loss: 1.5352410966193464
Epoch 3 of 10


100%|██████████| 9054/9054 [05:05<00:00, 29.68it/s]
100%|██████████| 3018/3018 [00:23<00:00, 131.20it/s]


Train Loss: 1.5830056611304517, Validation Loss: 1.5166558558337335
Best Validation Loss: 1.5166558558337335
Epoch 4 of 10


100%|██████████| 9054/9054 [05:11<00:00, 29.03it/s]
100%|██████████| 3018/3018 [00:23<00:00, 131.16it/s]


Train Loss: 1.5806448261864747, Validation Loss: 1.5286006583647669
Best Validation Loss: 1.5166558558337335
Epoch 5 of 10


100%|██████████| 9054/9054 [04:57<00:00, 30.44it/s]
100%|██████████| 3018/3018 [00:23<00:00, 131.19it/s]


Train Loss: 1.5803480836922217, Validation Loss: 1.5645999267766457
Best Validation Loss: 1.5166558558337335
Epoch 6 of 10


100%|██████████| 9054/9054 [04:58<00:00, 30.32it/s]
100%|██████████| 3018/3018 [00:22<00:00, 131.26it/s]


Train Loss: 1.5809528193964661, Validation Loss: 1.5277305251719384
Best Validation Loss: 1.5166558558337335
Epoch 7 of 10


100%|██████████| 9054/9054 [05:00<00:00, 30.10it/s]
100%|██████████| 3018/3018 [00:23<00:00, 131.14it/s]


Train Loss: 1.5815212822115814, Validation Loss: 1.5313032572438496
Best Validation Loss: 1.5166558558337335
Epoch 8 of 10


100%|██████████| 9054/9054 [05:02<00:00, 29.95it/s]
100%|██████████| 3018/3018 [00:22<00:00, 131.41it/s]


Train Loss: 1.5795763257997095, Validation Loss: 1.5868910046465907
Best Validation Loss: 1.5166558558337335
Epoch 9 of 10


100%|██████████| 9054/9054 [05:04<00:00, 29.71it/s]
100%|██████████| 3018/3018 [00:22<00:00, 131.55it/s]


Train Loss: 1.581666468895901, Validation Loss: 1.5503532421004307
Best Validation Loss: 1.5166558558337335
Epoch 10 of 10


100%|██████████| 9054/9054 [05:09<00:00, 29.21it/s]
100%|██████████| 3018/3018 [00:22<00:00, 131.26it/s]

Train Loss: 1.5813977529150987, Validation Loss: 1.5894147534296164
Best Validation Loss: 1.5166558558337335






<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Test loop </div>

In [38]:
test(test_loader, model, criterion, device)

100%|██████████| 3018/3018 [00:27<00:00, 111.72it/s]

Test Loss: 1.6000038638385106, Test Accuracy: 0.521392478462558






<div style="background-color:#F0E3D2; color:#19180F; font-size:15px; font-family:Verdana; padding:10px; border: 2px solid #19180F; border-radius:10px"> 
📌
Predict on new text </div>

In [39]:
input_text = '''I am a victim of Identity Theft & currently have an Experian account that 
I can view my Experian Credit Report and getting notified when there is activity on 
my Experian Credit Report. For the past 3 days I've spent a total of approximately 9 
hours on the phone with Experian. Every time I call I get transferred repeatedly and 
then my last transfer and automated message states to press 1 and leave a message and 
someone would call me. Every time I press 1 I get an automatic message stating than you 
before I even leave a message and get disconnected. I call Experian again, explain what 
is happening and the process begins again with the same end result. I was trying to have 
this issue attended and resolved informally but I give up after 9 hours. There are hard 
hit inquiries on my Experian Credit Report that are fraud, I didn't authorize, or recall 
and I respectfully request that Experian remove the hard hit inquiries immediately just 
like they've done in the past when I was able to speak to a live Experian representative 
in the United States. The following are the hard hit inquiries : BK OF XXXX XX/XX/XXXX 
XXXX XXXX XXXX  XX/XX/XXXX XXXX  XXXX XXXX  XX/XX/XXXX XXXX  XX/XX/XXXX XXXX  XXXX 
XX/XX/XXXX'''

In [40]:
input_text = input_text.lower()
input_text = re.sub(r"[^\w\d'\s]+", " ", input_text)
input_text = re.sub("\d+", "", input_text)
input_text = re.sub(r'[x]{2,}', "", input_text)
input_text = re.sub(' +', ' ', input_text)

# use the same tokenizer 
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

# encode text with the tokenizer 
tokens = tokenizer(input_text, padding="max_length",
                 max_length=seq_len, truncation=True,
                 return_tensors="pt")

input_ids = tokens["input_ids"]
attention_mask = tokens["attention_mask"]

In [41]:
device = torch.device("cuda:0" if torch.cuda.is_available()
                     else "cpu")

In [42]:
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)

In [43]:
input_ids = torch.squeeze(input_ids, 1)

In [44]:
label_encoder = load_file(label_encoder_path)
num_classes = len(label_encoder.classes_)

In [45]:
# Create model object
model = BertClassifier(dropout, num_classes)

# Load trained weights
model.load_state_dict(torch.load(model_path))

# Move the model to GPU if available
if torch.cuda.is_available():
    model = model.cuda()
    
# Forward pass
out = torch.squeeze(model(input_ids, attention_mask))

# Find predicted class
prediction = label_encoder.classes_[torch.argmax(out)]
print(f"Predicted Class: {prediction}")

Predicted Class: credit_report
