CNARS/P: A tool for predicting job types in survey narratives using BERT models   

This code is to build a pipeline to train BERT models with labeled data and predict classes with unlabeled data for job type classifictions.
This project was supported by a grant from the National Science Foundation: FW-HTF-P Understanding Gig Work and its Effects on Wellbeing over the Life Course in the United States: A Machine Learning Approach (PI - Dr. Joelle Abramowitz; Co-PI - Dr. Jinseok Kim; 2021-10 ~ 2023.9)


Authors: Jinseok Kim (University of Michigan) & Jenna Kim (University of Illinois at Urbana-Champaign)   
Created: 2022/6/14  
Last Modified: 2023/10/08  

Updates:  
* Download pretrained model (& tokenizer) using transformers API and save into local directory  
* Load pretrained model from local directory for training process  
* Save the best trained model (with highest validation accuracy) to the same directory  
* Load the best trained model locally for testing process  
* Train once, use the best model as many as needed for testing (labeled data vs unlabeled data)   
* Add different pretrained models: RoBERTa & BERTweet  
* Remove the function for data size change with ratio  

References:  
* https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/  
* https://www.youtube.com/watch?v=f-86-HcYYi8  
* https://mccormickml.com/2019/07/22/BERT-fine-tuning/  
* https://www.analyticsvidhya.com/blog/2021/12/multiclass-classification-using-transformers/

# 1. Setup

## 1-1. Install package and load libraires

Install the transformers package from Hugging Face which is a pytorch interface for working with a BERT

In [None]:
# transformer ver: 4.15.0
#!pip install transformers==4.15.0

In [None]:
#!pip install imblearn

In [None]:
# install PyTorch
# Note: No need to install PyTorch if this notebook is running on the AWS Sagemaker with pytorch kernel

#!pip install torch==1.5.0

In [None]:
# Check if the packages are correctly installed
#!pip list

Load other libraries

In [None]:
import timeit
import transformers

import os

import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report

from collections import defaultdict
from textwrap import wrap

from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F


In [None]:
# Set up for plots and parameters

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='darkgrid', palette='muted', font_scale=1.5)
COLORS_PALETTE = ["#01BEFE", "#FFDD00", "#FF7D00", "#FF006D", "#ADFF02", "#8F00FF"]
sns.set_palette(sns.color_palette(COLORS_PALETTE))
rcParams["figure.figsize"] = (12, 6)

In [None]:
# Hide warning messages from display
import warnings
warnings.filterwarnings('ignore')

## 1-2. Check GPU for training

Note: If you use Google Colab, before running the next cell, make sure that the runtime type is set to GPU by going to Runtime => Change runtime type => select GPU for Hardware accelerator

In [None]:
# Check a version of CUDA
!nvcc --version

In [None]:
# Check if there's a GPU available
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU    
    device = torch.device("cuda")

    print('There are {:d} GPU(s) available.'.format(torch.cuda.device_count()))
    print('We will use the GPU: ', torch.cuda.get_device_name(0))

else:
    device = torch.device("cpu")
    
    print('No GPU available, using the CPU instead.')

In [None]:
#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_cached(0)/1024**3,1), 'GB')

In [None]:
# check GPU memory and utilization
!nvidia-smi

# To check the GPU memory usage while the process is running
# open a terminal in the directory (Go to New-> Terminal) and type the above code

In [None]:
# clear the occupied cuda memory for efficient use
import gc

gc.collect()
torch.cuda.empty_cache()

# Kill a process in running if more GPU space is needed
#!sudo kill -9 3320

# 2. Load data

## 2-1. If you load data from Google Drive directory

In [None]:
#import os
#from google.colab import drive
#drive.mount('/gdrive')
#%cd /gdrive

When running the above code, you might be required to enter authorization code to connect to Google Drive folder

In [None]:
# Access the directory where a dataset is stored
#os.listdir("/gdrive/My Drive/Colab Notebooks/LabelingProject")

In [None]:
#path = "/gdrive/My Drive/Colab Notebooks/LabelingProject"

In [None]:
# Load the dataset into a pandas dataframe
# IMDB dataset: similar to Medline data
# Required for reducing the data size due to the GPU constraints 
#df_raw = pd.read_csv(os.path.join(path, "IMDB Dataset.csv"))  

# CoLA dataset: one sentence per each instance
#df1 = pd.read_csv(os.path.join(path, "in_domain_train.tsv"), delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])
#df2 = pd.read_csv(os.path.join(path, "in_domain_dev.tsv"), delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

#print(df1.shape)
#print(df2.shape)

#df_raw = pd.concat([df1, df2], ignore_index=True)

#print(df_raw.shape)

## 2-2. If you load data from AWS SageMaker or your local directory

In [None]:
def load_data(filename, colname, record):
    
    """
    Read in input file and load data
    
    filename: csv file
    record: text file to save summary

    return: dataframe
    
    """
    ## 1. Read in data from csv file
    #df = pd.read_csv(filename, encoding="utf-8", engine='python')
    
    # If unicodedecode error appears, use one of below options
    # 1. Save dataset in utf-8 format: open csv file (file->save as-> CSV UTF-8)
    # 2. Change encoding to 'unicode-escape' 
    df = pd.read_csv(filename, encoding="unicode-escape")
    
    ## 2. No of rows and columns & data view
    print("No of Rows (Raw data): {}".format(df.shape[0]), file=record)
    print("No of Columns: {}".format(df.shape[1]), file=record)    
    print("No of Rows (Raw data): {}".format(df.shape[0]))
    print("No of Columns: {}".format(df.shape[1]))

    print("\n<Data View (Raw data)>\n{}".format(df.head(15)), file=record)
    print("\n<Data View (Raw data)>\n{}".format(df.head(15)))
    
    ## 3. Replace null values in any rows
    # 3-1. Identify columns with null values
    print("\nCheck if null value exists:\n")
    print(df.info())
    print("\nCheck if null value exists:\n", file=record)
    df.info(buf=record)

    # 3-2. Replace empty values with numpy NaN
    df = df.replace(r'^\s*$', np.nan, regex=True)
    
    # 3-3. Drop rows if both text and category columns have null values
    df.dropna(how='all', subset=['text', 'category'], inplace = True)

    print("\nNo of rows (After handling null values): {}".format(df.shape[0]), file=record)
    print("No of columns: {}".format(df.shape[1]), file=record)  
    print("\nNo of rows (After handling null values): {}".format(df.shape[0]))
    print("No of columns: {}".format(df.shape[1]))

    print("\n<Data View (Null handled)>\n{}".format(df.head(15)), file=record)
    print("\n<Data View (Null handled)>\n{}".format(df.head(15)))
        
    ## 4. Select columns for processing
    # if unlabeled data, fill in with a proxy number
    if 'label' not in df.columns:
      df['label']=[111 for i in range(df.shape[0])]
    
    # Select columns based on given column name
    if colname == "text":
      df = df[['instanceid', 'text', 'label']]
      df.rename({"text": "sentence"}, axis=1, inplace=True)
    elif colname == "category":
      df = df[['pmid', 'category', 'label']]
      df['category_str'] = df[['category']].apply(lambda x : str(x))
      df.rename({"category_str": "sentence"}, axis=1, inplace=True)
    elif colname == "mix":
      df['mix'] = df[['text','category']].apply(lambda x : '{} {}'.format(x[0],str(x[1])), axis=1)
      df = df[['pmid', 'mix', 'pubtype']]
      df.rename({"mix": "sentence"}, axis=1, inplace=True)
    
    ## 5. Check the data
    print("\n<Data View: Selected Input>\n{}".format(df.head()), file=record)
    print("\n<Data View: Selected Input>\n{}".format(df.head()))
    
    print('\nClass Counts(label, row): Total', file=record)
    print(df.label.value_counts(), file=record)  
    print('\nClass Counts(label, row): Total')
    print(df.label.value_counts())
     
    return df

# 3. Data Processing

## 3-1. Check the distribution of token length

In [None]:
def token_distribution(df, tokenizer, record):
    token_lens = []
    long_tokens = []

    # remove null values
    df = df.dropna()
    
    for id, txt in zip(df.instanceid, df.sentence):
        tokens = tokenizer.encode(txt, padding=True, truncation=True, max_length=512)
        token_lens.append(len(tokens))
    
        # Check a sentence with extreme length
        if len(tokens) > 150:
            long_tokens.append((id, len(tokens)))   
  
    print("\n********** Check if Long Sentence Exists **********\n", file=record)
    print("\n********** Check if Long Sentence Exists **********\n")

    if len(long_tokens)>0:
      print(long_tokens, file=record) 
      print(long_tokens) 
    else:
      print("No long sentence (<=150 tokens)", file=record)
      print("No long sentence (<=150 tokens)")
    
    print("\nMin token:", min(token_lens), file=record)
    print("Max token:", max(token_lens), file=record)
    print("Avg token:", round(sum(token_lens)/len(token_lens)), file=record)
    print("\nMin token:", min(token_lens))
    print("Max token:", max(token_lens))
    print("Avg token:", round(sum(token_lens)/len(token_lens)))
    
    # plot the distribution
    sns.displot(token_lens)
    plt.xlim([0, max(token_lens)+10])
    plt.xlabel("Token Count")

## 3-2. Create a PyTorch dataset

In [None]:
class LabelDataset(Dataset):
    def __init__(self, reviews, targets, tokenizer, max_len):
        self.reviews = reviews
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.reviews)

    def __getitem__(self, item):
        review = str(self.reviews[item])
        review = " ".join(review.split())
        target = self.targets[item]

        encoding = self.tokenizer.encode_plus(
            review,
            None,                                    # second parameter is needed for a task of sentence similarity
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_token_type_ids=True,  
            return_attention_mask=True,
            return_tensors='pt')

        return {
            'texts': review,
            'input_ids': encoding['input_ids'].flatten(),            # flatten() reduce dimension: e.g., [1, 512] -> [512]
            'token_type_ids': encoding['token_type_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'targets': torch.tensor(target, dtype=torch.long)
        }

## 3-3. Sampling

In [None]:
def sample_data(X_train, y_train, sampling=0, sample_method='over'):
    """
       Sampling input train data
       
       X_train: dataframe of X train data
       y_train: datafram of y train data
       sampling: indicator of sampling funtion is on or off
       sample_method: method of sampling (oversampling or undersampling)
       
    """
    
    from imblearn.over_sampling import RandomOverSampler
    from imblearn.under_sampling import RandomUnderSampler
    
    if sampling:
        if sample_method == 'over':
            oversample = RandomOverSampler(random_state=42)
            X_over, y_over = oversample.fit_resample(X_train, y_train)
            print('\n****** Data Sampling ******')
            print('\nOversampled Data (class, Rows):\n{}'.format(y_over.value_counts()))
            X_train_sam, y_train_sam = X_over, y_over
            
        elif sample_method == 'under':
            undersample = RandomUnderSampler(random_state=42)
            X_under, y_under = undersample.fit_resample(X_train, y_train)
            print('\n****** Data Sampling ******')
            print('\nUndersampled Data (class,Rows):\n{}'.format(y_under.value_counts()))
            X_train_sam, y_train_sam = X_under, y_under
    else:
        X_train_sam, y_train_sam = X_train, y_train 
        print('\n****** Data Sampling ******')
        print('\nNo Sampling Performed\n')
    
    return X_train_sam, y_train_sam

## 3-4. Create a data loader & classifier

In [None]:
def create_data_loader(df, tokenizer, max_len, batch_size):
    ds = LabelDataset(
        reviews = df.sentence.to_numpy(),
        targets = df.label.to_numpy(),
        tokenizer = tokenizer,
        max_len = max_len
    )
    
    return DataLoader(
        ds,
        batch_size = batch_size,
        num_workers = 1)

In [None]:
class LabelClassifier(nn.Module):
    
    def __init__(self, n_classes, model_loaded):
        super(LabelClassifier, self).__init__()
        self.bert = model_loaded
        self.dropout = nn.Dropout(p=0.3)
        self.linear = nn.Linear(self.bert.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask, token_type_ids):
        bert_out = self.bert(
            input_ids = input_ids,
            attention_mask = attention_mask,
            token_type_ids = token_type_ids)
        output_dropout = self.dropout(bert_out.pooler_output)
        output = self.linear(output_dropout)
    
        return output

# 4. Training & Validation

The BERT authors's recommendations for fine-tuning:  
* Batch size: 16, 32  
* Learning rate (Adam): 5e-5, 3e-5, 2e-5  
* Number of epochs: 2, 3, 4

In [None]:
def train_model(
    model,
    data_loader,
    loss_fn,
    optimizer,
    device,
    scheduler,
    n_examples,
    outfile):
    
    model = model.train()
    
    losses = []
    correct_predictions = 0

    for d in data_loader:
        input_ids = d["input_ids"].to(device, dtype=torch.long)
        attention_mask = d["attention_mask"].to(device, dtype=torch.long)
        token_type_ids = d["token_type_ids"].to(device, dtype=torch.long)
        targets = d["targets"].to(device)

        outputs = model(
            input_ids = input_ids,
            attention_mask = attention_mask,
            token_type_ids=token_type_ids
        )

        _, preds = torch.max(outputs, dim=1)
        loss = loss_fn(outputs, targets)

        # printout for checking the prediction & target
        #print("Pred: ", preds)
        #print("Target: ", targets)

        correct_predictions += torch.sum(preds == targets)
        losses.append(loss.item())

        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
    print("Correct Prediction (Train): {} out of {}".format(correct_predictions.int(), n_examples), file=outfile)
    print("Correct Prediction (Train): {} out of {}".format(correct_predictions.int(), n_examples))

    return correct_predictions.double() / n_examples, np.mean(losses)

In [None]:
def eval_model(
    model, 
    data_loader, 
    loss_fn, 
    device, 
    n_examples,
    outfile):
    
    model = model.eval()

    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for d in data_loader:
            input_ids = d["input_ids"].to(device, dtype=torch.long)
            attention_mask = d["attention_mask"].to(device, dtype=torch.long)
            token_type_ids = d["token_type_ids"].to(device, dtype=torch.long)
            targets = d["targets"].to(device)

            outputs = model(
                input_ids = input_ids,
                attention_mask = attention_mask,
                token_type_ids = token_type_ids
                )
            
            _, preds = torch.max(outputs, dim=1)
            loss = loss_fn(outputs, targets)

            correct_predictions += torch.sum(preds == targets)
            losses.append(loss.item())
    
    print("Correct Prediction (Eval): {} out of {}".format(correct_predictions.int(), n_examples), file=outfile)
    print("Correct Prediction (Eval): {} out of {}".format(correct_predictions.int(), n_examples))
    
    return correct_predictions.double()/n_examples, np.mean(losses)

In [None]:
def plot_train_history(history, modelname):
  
  plt.figure(figsize=(8, 6))
  
  # detach to cpu and convert float
  train_acc = [float(x.to("cpu").numpy()) for x in history["train_acc"]]
  val_acc = [float(x.to("cpu").numpy()) for x in history["val_acc"]]
  
  # plot
  plt.plot(train_acc, 'b-o', label="train accuracy")
  plt.plot(val_acc, 'r-o', label="validation accuracy")
  plt.title("Training History")
  plt.ylabel("Accuracy")
  plt.xlabel("Epoch")
  plt.legend(loc='lower right', fontsize=15)
  plt.xticks(history["epoch"])
  plt.yticks(np.arange(0,1.2,step=0.1))
  plt.ylim([0,1.02])

  # save to file
  #filename = "trainingplot_" + modelname + ".png"
  #plt.savefig(filename)

  plt.show()

In [None]:
def training_loop(epochs, 
                  modelname, 
                  model, 
                  train_data_loader, 
                  val_data_loader, 
                  loss_fn, 
                  optimizer, 
                  device, 
                  scheduler, 
                  n_train, 
                  n_val,
                  model_file,
                  record):
    
    print("\n**** Model Name: " + modelname + " *****", file=record)
    print("\n**** Model Name: " + modelname + " *****")
    
    history = defaultdict(list)
    best_accuracy = 0

    for epoch in range(epochs):
        print("\nEpoch {} / {}".format(str(epoch + 1), str(epochs)), file=record)
        print("-" * 60, file=record)
    
        print("\nEpoch {} / {}".format(str(epoch + 1), str(epochs)))
        print("-" * 60)
    
        train_acc, train_loss = train_model(
            model, 
            train_data_loader,
            loss_fn,
            optimizer,
            device,
            scheduler,
            n_train,
            outfile=record)
    
        print("Train Loss: {}, Accuracy: {}\n".format(train_loss, train_acc), file=record)
        print("Train Loss: {}, Accuracy: {}\n".format(train_loss, train_acc))
    
        val_acc, val_loss = eval_model(
            model,
            val_data_loader,
            loss_fn,
            device,
            n_val,
            outfile=record)
    
        print("Validation Loss: {}, Accuracy: {}".format(val_loss, val_acc), file=record)  
        print("Validation Loss: {}, Accuracy: {}".format(val_loss, val_acc))

        # store model state while training
        history["epoch"].append(epoch)
        history["train_acc"].append(train_acc)
        history["train_loss"].append(train_loss)
        history["val_acc"].append(val_acc)
        history["val_loss"].append(val_loss)
        
        # save the best performing model
        if val_acc > best_accuracy:
            if model_file:
              torch.save(model.state_dict(), model_file)

              # for checking only: model's state_dict 
              #print("\nModel's state_dict:\n")
              #for param_tensor in model.state_dict():
              #  print(param_tensor, "\t", model.state_dict()[param_tensor].size())
            
            best_accuracy = val_acc
    
    # Plot training & validation accuracy
    plot_train_history(history, modelname)

# 5. Prediction & Evaluation

In [None]:
def get_predictions(model, data_loader):
    
    model = model.eval()
    
    review_texts = []
    predictions = []
    prediction_probs = []
    real_values = []
    
    with torch.no_grad():
        for d in data_loader:
            texts = d["texts"]
            input_ids = d["input_ids"].to(device, dtype=torch.long)
            attention_mask = d["attention_mask"].to(device, dtype=torch.long)
            token_type_ids = d["token_type_ids"].to(device, dtype=torch.long)
            targets = d["targets"].to(device)
            
            outputs = model(
                input_ids = input_ids,
                attention_mask = attention_mask,
                token_type_ids = token_type_ids
            )
            
            _, preds = torch.max(outputs, dim=1)

            # Apply the softmax or sigmoid function to normalize the raw output(logits) to get probability for each class
            probs = F.softmax(outputs, dim=1)
            
            review_texts.extend(texts)
            predictions.extend(preds)
            prediction_probs.extend(probs)
            real_values.extend(targets)

    # move the data to cpu
    predictions = torch.stack(predictions).cpu()
    prediction_probs = torch.stack(prediction_probs).cpu().detach().numpy()
    real_values = torch.stack(real_values).cpu()

    return review_texts, predictions, prediction_probs, real_values

In [None]:
def evaluate_model(y_test, y_pred, record, eval_model=0):
    """
      evaluate model performance
      
      y_test: y test data
      y_pred: t prediction score
      eval_model: indicator if this funtion is on or off
      
    """
    
    if eval_model:
        
        print('\n************** Model Evaluation **************', file=record)
        print('\n************** Model Evaluation **************')
        
        print('\nConfusion Matrix:\n', file=record)
        print('\nConfusion Matrix:\n')
        print(confusion_matrix(y_test, y_pred), file=record)
        print(confusion_matrix(y_test, y_pred))
        
        print('\nClassification Report:\n', file=record)
        print('\nClassification Report:\n')
        print(classification_report(y_test, y_pred, digits=4), file=record)
        print(classification_report(y_test, y_pred, digits=4)) 

In [None]:
def predict_proba(df_test, y_text, y_test, y_pred, y_pred_probs, n_class, proba_file, test_unlabel_on=0, proba_out=0):
    
    """
       Get probability of each class
       
       df_test: original X test data
       y_text: text data sentence
       y_test: original y values
       y_pred: predicted y values
       y_pred_probs: probability scores of prediction
       n_class: number of label class
       proba_file: output file of probability scores
       test_unlabel_on: on(1) and off(0) for using unlabeled test data
       proba_out: on (1) and off (0) for probability output
       
    """
    if proba_out:
       
        prob_dict = {}
        for i in range(n_class):
          key_str = 'proba_' + str(i)
          value_str = y_pred_probs[:, i]
          prob_dict[key_str] = value_str
        
        df_proba = pd.DataFrame(prob_dict)
        
        if test_unlabel_on:
          y_test = df_test["label"].replace(111, "NA")

        df_pred = pd.DataFrame({'instanceid': df_test["instanceid"],
                                'input': y_text,
                                'act': y_test,
                                'pred': y_pred})
        
        df_result = pd.concat([df_pred, df_proba], axis=1)
        
        ## Save output
        df_result.to_csv(proba_file, encoding='utf-8', header=True, index=False)

# 6. Incorporating into main function

In [None]:
def train_test_labeled(input_file, colname, sample_on, sample_type, max_len, 
                       batch_size, modelname, model_download_path, device,
                       learning_rate, epochs, model_file_path, eval_on, 
                       proba_on, proba_file, result_file, record):
    
    """
       Training and testing using labeled data
       
       input_file: input file
       colname: colume name for selection between title and abstract
       sample_on: indicator of sampling on or off
       sample_type: sample type to choose if sample_on is 1
       model_method: name of classifier to be applied for model fitting
       eval_on: indicator of model evaluation on or off
       proba_file: name of output file of probability
       result_file: name of output file of evaluation
       
    """
    
    ## 1. Load data
   
    print("\n************** Loading Data **************\n", file=record)
    print("\n************** Loading Data **************\n")
    df = load_data(input_file, colname, record=record)
    
    # Number of label class
    n_class = len(df['label'].value_counts().keys().tolist())
    print("\nNumber of label class: ", n_class, file=record)
    print("\nNumber of label class: ", n_class)

    # Check the first instance of input data
    print("First Sentence: \n", df.sentence[0], file=record)
    print("First Sentence: \n", df.sentence[0])
    
    ## 2. Train and test split
    
    print("\n************** Spliting Data **************\n", file=record)
    print("\n************** Spliting Data **************\n")
    
    df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df.label)
    df_val, df_test = train_test_split(df_test, test_size=0.5, random_state=42, stratify=df_test.label)
    
    print("Train Data: {}".format(df_train.shape), file=record)
    print("Val Data: {}".format(df_val.shape), file=record)
    print("Test Data: {}".format(df_test.shape), file=record)
    
    print("Train Data: {}".format(df_train.shape))
    print("Val Data: {}".format(df_val.shape))
    print("Test Data: {}".format(df_test.shape))
    
    print('\nClass Counts(label, row): Train', file=record)
    print(df_train.label.value_counts(), file=record)
    print('\nClass Counts(label, row): Val', file=record)
    print(df_val.label.value_counts(), file=record)
    print('\nClass Counts(label, row): Test', file=record)
    print(df_test.label.value_counts(), file=record)
    
    print("\nTest Data")
    print(df_test.head())
    
    # Reset index
    df_train=df_train.reset_index(drop=True)
    df_val=df_val.reset_index(drop=True)
    df_test=df_test.reset_index(drop=True)
    
    print("\n************** Processing Data **************", file=record)
    print("\n************** Processing Data **************")
    print("Train Data: {}".format(df_train.shape), file=record)
    print("Val Data: {}".format(df_val.shape), file=record)
    print("Test Data: {}".format(df_test.shape), file=record)
    
    print("Train Data: {}".format(df_train.shape))
    print("Val Data: {}".format(df_val.shape))
    print("Test Data: {}".format(df_test.shape))
    
    print('\nClass Counts(label, row): Train', file=record)
    print(df_train.label.value_counts(), file=record)
    print('\nClass Counts(label, row): Val', file=record)
    print(df_val.label.value_counts(), file=record)
    print('\nClass Counts(label, row): Test', file=record)
    print(df_test.label.value_counts(), file=record)
    
    print("\nTest Data")
    print(df_test.head())
    
    ## 4. Sampling
    if sample_on:
        X_train = df_train.iloc[:, :-1]
        y_train = df_train.iloc[:, -1]
    
        # Sampling
        X_train_samp, y_train_samp = sample_data(X_train, y_train, sampling=sample_on, sample_method=sample_type)
    
        print(y_train_samp.value_counts(), file=record)

        # Combine x_train and y_train data
        df_train_concat = pd.concat([X_train_samp, y_train_samp], axis=1)

        print(df_train_concat.info())
        print(df_train_concat.head())
    
        # replace train data with sampled data
        df_train = df_train_concat
        print(df_train.shape)
    
    ## 5. Load data
    # Load downloaded pretrained tokenizer from local directory
    tokenizer = AutoTokenizer.from_pretrained(model_download_path)

    # Check token distribution for defining MAX_LEN value
    print("\n************** Token Distribution of Train Data **************")
    token_distribution(df_train, tokenizer, record=record)

    train_data_loader = create_data_loader(df_train, tokenizer, max_len, batch_size)
    val_data_loader = create_data_loader(df_val, tokenizer, max_len, batch_size)
    test_data_loader = create_data_loader(df_test, tokenizer, max_len, batch_size)

    ## 6. Model Training
    print("\n************** Training Model: " + modelname + " **************", file=record)
    print("\n************** Training Model: " + modelname + " **************")
    
    # Check training time
    start_time = timeit.default_timer()
    
    n_train = len(df_train)    
    n_val = len(df_val)
    
    # Load downloaded pretrained model from local directory and move it to GPU
    model_loaded = AutoModel.from_pretrained(model_download_path) 
    model = LabelClassifier(n_class, model_loaded)
    model = model.to(device)   
    
    # define Optimizer & scheduler & loss function
    optimizer = AdamW(model.parameters(), lr=learning_rate, correct_bias=False)
    total_steps = len(train_data_loader) * epochs

    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps = 0,
        num_training_steps = total_steps)

    loss_fn = nn.CrossEntropyLoss().to(device)
    
    # Loop training with epochs
    training_loop(epochs, modelname, model, train_data_loader, val_data_loader, 
                  loss_fn, optimizer, device, scheduler, n_train, n_val, 
                  model_file=model_file_path, record=record)
    
    elapsed = timeit.default_timer() - start_time
    print("\nTraining Time ({} epochs): {}".format(epochs, round(elapsed, 2)), file=record)
    print("\nTraining Time ({} epochs): {}".format(epochs, round(elapsed,2)))
    
    ## 7. Prediction  
    print("\n************** Getting Predictions **************", file=record)
    print("\n************** Getting Predictions **************")

    # Load the best trained model 
    model = LabelClassifier(n_class, model_loaded)
    model.load_state_dict(torch.load(model_file_path))
    model = model.to(device)   
    
    # Get predictions
    y_text, y_pred, y_pred_probs, y_test = get_predictions(model, test_data_loader) 
    
    ## 8. Evaluating model performance      
    print("\n************** Evaluating Model Performance **************", file=record)
    print("\n************** Evaluating Model Pperformance **************")    
    evaluate_model(y_test, y_pred, record=record, eval_model=eval_on)
        
    ## 9. Probability prediction   
    print("\n************** Producing Probability File **************", file=record)
    print("\n************** Creating Probability File **************") 
    predict_proba(df_test, y_text, y_test, y_pred, y_pred_probs, n_class, proba_file=proba_file, test_unlabel_on=test_unlabel_on, proba_out=proba_on)
    
    print("\nOutput file: '" + result_file + "' Created", file=record)
    print("\nOutput file: '" + result_file + "' Created")

In [None]:
def test_unlabeled(test_file, colname, max_len, batch_size, device, n_class,
                   model_download_path, modelname, model_file_path, proba_on, 
                   test_unlabel_on, result_file, proba_file, record):
    """
       Predict label for unlabeled test data and get probability file
       
       test_file: unlabeled data for getting prediction
       colname: colume name for feeding as input string
       eval_on: indicator of model evaluation on or off
       proba_file: name of output file of probability
       result_file: name of output file of evaluation
       n_class: number of label class
       
    """

    # Check testing time
    start_time = timeit.default_timer()
    
    # Load downloaded tokenizer & model
    tokenizer = AutoTokenizer.from_pretrained(model_download_path)
    model_loaded = AutoModel.from_pretrained(model_download_path)

    ## 1. Load test data     
    print("\n************** Loading Unlabeled Test Data **************", file=record)
    print("\n************** Loading Unlabeled Test Data **************")
    df_unlabel = load_data(test_file, colname, record=record)
    test_data_loader = create_data_loader(df_unlabel, tokenizer, max_len, batch_size)
    
    ## 2. Load the best trained model with checkpoint 
    print("\n************** Loading Best Trained Model **************", file=record)
    print("\n************** Loading Best Trained Model **************")
    print("\nModel Name: " + modelname, file=record)
    print("\nModel Name: " + modelname)

    model = LabelClassifier(n_class, model_loaded)
    model.load_state_dict(torch.load(model_file_path))
    model = model.to(device)

    # for checking only: model's state_dict 
    #print("\nLoaded Model's state_dict:\n")
    #for param_tensor in model.state_dict():
    #  print(param_tensor, "\t", model.state_dict()[param_tensor].size())

    ## 3. Prediction  
    print("\n************** Getting Predictions **************", file=record)
    print("\n************** Getting Predictions **************")
    y_text, y_pred, y_pred_probs, y_test = get_predictions(model, test_data_loader) 
    
    ## 4. Evaluating model performance      
    print("\n************** No Evaluation Conducted **************", file=record)
    print("\n************** No Evaluation Conducted **************")
        
    ## 5. Probability prediction   
    print("\n************** Creating Probability File **************", file=record)
    print("\n************** Creating Probability File **************") 
    df_test=df_unlabel
    predict_proba(df_test, y_text, y_test, y_pred, y_pred_probs, n_class, proba_file=proba_file, test_unlabel_on=test_unlabel_on, proba_out=proba_on)
    

    print("\nOutput file: '" + result_file + "' Created", file=record)
    print("\nOutput file: '" + result_file + "' Created")

    elapsed = timeit.default_timer() - start_time
    print("\nTesting Time: {}".format(round(elapsed, 2)), file=record)
    print("\nTesting Time: {}".format(round(elapsed,2)))

In [None]:
def main(input_file, test_file, colname, sample_on, sample_type, max_len, 
         batch_size, modelname, model_download_path, device, n_class,
         learning_rate, epochs, model_file_path, test_unlabel_on, eval_on, 
         proba_on, proba_file, result_file):
    
    ## 0. open result file for records
    f=open(result_file, "a")

    if test_unlabel_on==0: 
      train_test_labeled(input_file, colname, sample_on, sample_type, max_len, 
                         batch_size, modelname, model_download_path, device, 
                         learning_rate, epochs, model_file_path, eval_on, 
                         proba_on, proba_file, result_file, record=f)
    
    elif test_unlabel_on==1:
      test_unlabeled(test_file, colname, max_len, batch_size, device, n_class,
                     model_download_path, modelname, model_file_path, proba_on,
                     test_unlabel_on, result_file, proba_file, record=f)
    
    f.close()

# 7. Main Code: Set Parameters & Run

In [None]:
%%time

if __name__== "__main__":
    
    #####################################################################################    
    ############################### Set Parameter Values ################################
    #####################################################################################

    ###### 1. Input file & column name and test (unlabeled) data exists?
    input_filename = "multi_labeled.csv"  
    column_name = "text"                                         # 'text'; 'category'; 'mix' for text+category
    
    test_unlabel_on = 0                                          # 1 if unlabeled data exists for test; otherwise 0
    
    test_filename = "multi_unlabeled.csv"                          # filename if test_unlabel_on=1; otherwise None
    num_class=5                                                  # only needed when test_unlabel_on=1
       
    ###### 2. Sampling applied?
    sampling_on = 0                                             # 0 for no sampling; 1 for sampling
    sampling_type = 'under'                                      # Use when sampling_on=1; 'over'(oversampling), 'under'(undersampling)

    ###### 3. Which pretrained model to use?

    ###### 3-1. Model name
    pretrained_model_name = 'bert-base-cased'
    #pretrained_model_name = 'roberta-base'
    #pretrained_model_name = 'vinai/bertweet-base'

    ###### 3-2. Download pretrained model?
    internet_on = 0                                              # 1 for downloading via internet connection; 0 for loading locally   
    
    modelname_string = pretrained_model_name.split("/")[-1] 
    pretrained_model_folder = "./model_" + modelname_string

    ###### 4. Hyperparameters 
    MAX_LEN = 100                                                # 150 for title; 512 for abs (Maximum input size: 512 (BERT))
    BATCH_SIZE = 16                                              # Batch size: 16 or 32
    EPOCHS = 4                                                   # Number of epochs: 2,3,4
    LEARNING_RATE = 5e-5                                         # Learning rate:5e-5, 3e-5, 2e-5

    ###### 5. Evaluation & probability file
    eval_on=1                                                    # 0 for no; 1 for yes (display confusion matrix/classification report)
    proba_on=1                                                   # 0 for no; 1 for yes (probability output)     

    ###### 6. Output filename suffix
    if test_unlabel_on==1:
      label_name = "unlabeled"
    else:
      label_name= "labeled"


    ###############################################3#####################################    
    ################################# Run Main Fuction ###############################
    #####################################################################################

    if internet_on:    
      
      # download files of pretrained tokenizer & model using Huggingface API
      tokenizer_download = AutoTokenizer.from_pretrained(pretrained_model_name)  
      model_download = AutoModel.from_pretrained(pretrained_model_name)

      # save to local folder
      os.makedirs(pretrained_model_folder, exist_ok = True)
      tokenizer_download.save_pretrained(pretrained_model_folder)    
      model_download.save_pretrained(pretrained_model_folder)

    else:    

      if sampling_on:
        proba_file = "result_all_" + modelname_string + "_" + sampling_type + "_" + column_name + "_" + label_name + ".csv"  
        eval_file = "eval_all_" + modelname_string + "_" + sampling_type + "_" + column_name + "_" + label_name + ".txt"
        best_model_file = pretrained_model_folder + "/bestmodel_" + modelname_string + "_" + sampling_type + "_" + column_name + ".pt"
      else:
        proba_file = "result_all_" + modelname_string + "_" + column_name + "_" + label_name + ".csv"  
        eval_file = "eval_all_" + modelname_string + "_" + column_name + "_" + label_name + ".txt" 
        best_model_file = pretrained_model_folder + "/bestmodel_" + modelname_string + "_" + column_name + ".pt"
            
      main(input_file=input_filename,
           test_file=test_filename, 
           colname=column_name,
           n_class=num_class,
           sample_on=sampling_on, 
           sample_type=sampling_type,
           max_len=MAX_LEN, 
           batch_size=BATCH_SIZE,
           modelname=modelname_string,
           device=device,
           model_download_path=pretrained_model_folder,
           learning_rate=LEARNING_RATE,
           epochs=EPOCHS,
           model_file_path=best_model_file, 
           test_unlabel_on=test_unlabel_on,
           eval_on=eval_on, 
           proba_file=proba_file,
           proba_on=proba_on,
           result_file=eval_file)
        
    print("\n************** Processing Completed **************\n")

### 1. internet_on=1
### Download all models -> local files

### 2. internet_on=0
### all the dowloaded models are saved in subfolders in the current directory wit model names included in folder name

### ====================================================================================

### 3. test_unlabel_on=0
###    test_filename = ".csv" (labeled data) => fine-tuning (train BERT model with my train dat => best model save)

### 4. test_unlabel_on=1
###    test_filename = ".csv"(unlabel data for testing only with best model)
### final output: summary file 2, proba file 2

