Here is our final notebook for the ESG Document Classifier task. We tried several models (SVM, XGBoost, LDA Classifier, and BERT) before settling on BERT.

Since BERT is an LLM, we knew we needed lots of data to get the model to converge and generalize. We used a variety of data augmentation techniques. For example, we randomly sampled paragraphs from the training data and asked several other chat-based LLMs (GPT4ALL and HuggingFace Chatbot) to generate samples like those. Our thinking was that this could teach the model to learn general contexts for jargon-specific words, rather than memorizing single sentences. However, it also runs the risk of overfitting the model to the training data (our model likely is overfit to some degree). To combat the overfitting, we looked at the results of our LDA analysis of the training data and asked the same chatbots to generate paragraphs around the general topics and word combinations that LDA produced (emission reduction, gender diversity in the workplace, corporate code of conduct, etc.). We did this at multiple scales (page-length, paragraph-length, sentence-length), and we ultimately arrived at 3 addendum files based on the training data--one each for environmental, social, and governance. These were then cleaned (lowercased, stripped of numbers, etc.).

The notebook below shows the model we used with this data to make our submission predictions. We tried a few varieties of BERT (base-cased, base-uncased, and large-cased) but base-uncased gave us the best accuracy. We also ran hyperparameter sweeps to point us toward parameters that would let us maximize our validation F1 score (validation data was held out from the training set) while converging with reasonable smoothness. Last, we performed EDA on the model's results to see where it was going wrong with the training data, and what solutions might exist. For example, when we fed the model each text sample from the beginning, the model performed poorly and overfit the training data (when compared to validation data). This happened because many of the data samples begin with a similar header, and since our model has a limited context window (it can only see 512 characters from any sample), it was missing most of the unique identifying information in each sample. We moved the context window towards the middle of each sample, and the model performed much better.



---

---


The code is based on https://towardsdatascience.com/text-classification-with-bert-in-pytorch-887965e5820f and https://www.tensorflow.org/text/tutorials/classify_text_with_bert#define_your_model. However, we added a bunch of features--for example, we added an extra linear layer and a softmax nonlinearity in the classifier module, and we changed the optimizer to AdamW for better generalization (in theory).



---



---



Path to saved model: https://drive.google.com/file/d/1RLuEuT_Pdbhzof6hCx7qbIPS2b5v9tU-/view?usp=share_link
Path to saved parameters: https://drive.google.com/file/d/1-0emYyGl0EkTh1epwzbRPI8cKqmiDHGb/view?usp=share_link

# Install Dependencies

In [None]:
!pip install transformers

from transformers import BertTokenizer
import torch
from torch import nn
from transformers import BertModel

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# Turn texts to strings of tokens
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

# Build A Dataset: OxML Data and Generated Data

Load the original data and clean. Then, load the data we generated from other transformers and clean. Finally, split the data into train and validation sets and build a DataLoader module that will feed batches to the model.

In [None]:
import re
pattern = r'[0-9]'

In [None]:
# Get OxML texts
my_file = open('/content/drive/MyDrive/oxml2023mlcases-esg-classifier/oxml_esg_texts.txt', "r")
data = my_file.read()
texts = data.split("unique_linebreak \n")
my_file.close()

# Get ESG labels
df = pd.read_csv('/content/drive/MyDrive/oxml2023mlcases-esg-classifier/data/labels.csv')

labels = {
    'governance': 0,
    'social': 1,
    'environmental': 2,
    'other': 3
}

# Add text column
texts = [text.replace('\n', ' ') for text in texts]
texts = [t.replace('%', ' ') for t in texts]
texts = [t.replace('$', ' ') for t in texts]
texts = [t.replace('.', ' ') for t in texts]
texts = [text.lower() for text in texts]
texts = [re.sub(pattern, '', t) for t in texts]
texts = [t.replace('-', '') for t in texts]
texts = [t.replace('- ', '') for t in texts]
texts = [t[500:] if len(t) > 500 else t for t in texts]

# Add text column to label dataframe
df['text'] = texts[:-1]

In [None]:
len(df)

1956

In [None]:
# Get extra texts
my_file = open('/content/governance_trianingGPT3.txt', "r")
data = my_file.read()
texts = data.split("unique_linebreak")
my_file.close()

texts = [t.replace('\n', ' ') for t in texts]
texts = [t.replace('%', ' ') for t in texts]
texts = [t.replace('$', ' ') for t in texts]
texts = [t.replace('.', ' ') for t in texts]
texts = [t.lower() for t in texts]
texts = [re.sub(pattern, '', t) for t in texts]
texts = [t.replace('-', '') for t in texts]
texts = [t.replace('- ', '') for t in texts]
texts = [t[700:] if len(t) > 700 else t for t in texts]

g_list = ['governance'] * (len(texts) - 1)

for i, j in zip(g_list, texts):
    df = df.append({'class': i, 'text': j}, ignore_index=True)

# -----------------------------------------------------
# Get extra texts
my_file = open('/content/social_trianingGPT3.txt', "r")
data = my_file.read()
texts = data.split("unique_linebreak")
my_file.close()

texts = [t.replace('\n', ' ') for t in texts]
texts = [t.replace('%', ' ') for t in texts]
texts = [t.replace('$', ' ') for t in texts]
texts = [t.replace('.', ' ') for t in texts]
texts = [t.lower() for t in texts]
texts = [re.sub(pattern, '', t) for t in texts]
texts = [t.replace('-', '') for t in texts]
texts = [t.replace('- ', '') for t in texts]
texts = [t[700:] if len(t) > 700 else t for t in texts]

g_list = ['social'] * (len(texts) - 1)

for i, j in zip(g_list, texts):
    df = df.append({'class': i, 'text': j}, ignore_index=True)

# -----------------------------------------------------
# Get extra texts
my_file = open('/content/environment_trianingGPT4_augmented.txt', "r")
data = my_file.read()
texts = data.split("unique_linebreak")
my_file.close()

texts = [t.replace('\n', ' ') for t in texts]
texts = [t.replace('%', ' ') for t in texts]
texts = [t.replace('$', ' ') for t in texts]
texts = [t.replace('.', ' ') for t in texts]
texts = [t.lower() for t in texts]
texts = [re.sub(pattern, '', t) for t in texts]
texts = [t.replace('-', '') for t in texts]
texts = [t.replace('- ', '') for t in texts]
texts = [t[700:] if len(t) > 700 else t for t in texts]

g_list = ['environmental'] * (len(texts) - 1)

for i, j in zip(g_list, texts):
    df = df.append({'class': i, 'text': j}, ignore_index=True)

# -----------------------------------------------------
# Get extra texts
extra_extra = pd.read_csv('/content/training_balanced_cleaned.csv')
extra_extra.drop(columns=['Unnamed: 0'], inplace=True)
extra_extra = extra_extra.dropna(subset=['text'])

texts = extra_extra['text']

texts = [t.replace('\n', ' ') for t in texts]
texts = [t.replace('%', ' ') for t in texts]
texts = [t.replace('$', ' ') for t in texts]
texts = [t.replace('.', ' ') for t in texts]
texts = [t.lower() for t in texts]
texts = [re.sub(pattern, '', t) for t in texts]
texts = [t.replace('-', '') for t in texts]
texts = [t.replace('- ', '') for t in texts]
texts = [t[700:] if len(t) > 700 else t for t in texts]

extra_extra['text'] = texts

df = pd.concat([df, extra_extra], ignore_index=True)

In [None]:
#Split the data into train and test
df_train, df_val = train_test_split(df, test_size=0.3, shuffle=True)

print(len(df_train), len(df_val))

In [None]:
# could change max length back to 512

class Dataset(torch.utils.data.Dataset):

    def __init__(self, df):

        self.labels = [labels[l] for l in df['class']]
        self.texts = [tokenizer(text,
                               padding='max_length', max_length = 512, truncation=True,
                                return_tensors="pt") for text in df['text']]

    def classes(self):
        return self.labels

    def __len__(self):
        return len(self.labels)

    def get_batch_labels(self, idx):
        # Fetch a batch of labels
        return np.array(self.labels[idx])

    def get_batch_texts(self, idx):
        # Fetch a batch of inputs
        return self.texts[idx]

    def __getitem__(self, idx):

        batch_texts = self.get_batch_texts(idx)
        batch_y = self.get_batch_labels(idx)

        return batch_texts, batch_y # -> a batch of tokenized texts and the corresponding labels

# Build A BERT Classification Model
Fit an FC classifier sequence on top of BERT. This will grab the embedded class tokens and pass them through the classifier.

In [None]:
# Original classifier idea
class BertClassifier(nn.Module):
  def __init__(self, dropout=0.5):

    super(BertClassifier, self).__init__()

    self.bert = BertModel.from_pretrained('bert-base-uncased')
    self.dropout = nn.Dropout(dropout)
    self.relu = nn.ReLU()
    self.fc1 = nn.Linear(768, 512) # -> input is a pooled 768-dim class embedding vector from transformer, and output is 4 classes
    self.fc2 = nn.Linear(512, 4)
    self.softmax = nn.LogSoftmax(dim=1)

  def forward(self, input_id, mask):
    # _ contains embedding vectors for all tokens in a sequence, and out contains the embedding vector of the class token for that sequence
    _, out = self.bert(input_ids=input_id, attention_mask=mask, return_dict=False)

    # the embedding vector for the class token gets passed through these layers for classification
    out = self.dropout(out)
    out = self.relu(out)
    out = self.fc1(out)
    out = self.fc2(out)
    out = self.softmax(out)

    return out

# Train the BERT Classification Model

In [None]:
from torch.optim import AdamW
from tqdm import tqdm
from sklearn.metrics import f1_score

In [None]:
def get_f1(labels, preds):
  avg = []

  for i, j in zip(labels, preds):
    f1 = f1_score(i, j, labels=np.unique(j), average='macro')
    avg.append(f1)

  total_f1 = sum(avg)/len(avg)

  return total_f1

In [None]:
def train(model, train_data, val_data, learning_rate, epochs):

  # set up datasets
  train, val = Dataset(train_data), Dataset(val_data)

  # load the datasets
  train_dataloader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
  val_dataloader = torch.utils.data.DataLoader(val, batch_size=batch_size)

  # try for a GPU
  use_cuda = torch.cuda.is_available()
  device = torch.device("cuda" if use_cuda else "cpu")

  # define loss and optimizer
  loss = nn.CrossEntropyLoss()
  optimizer = AdamW(model.parameters(), lr=learning_rate)

  avg_train_f1 = []
  avg_val_f1 = []

  if use_cuda:
    model = model.cuda()
    loss = loss.cuda()

  for epoch in range(epochs):

    train_outputs = []
    train_labels = []
    val_outputs = []
    val_labels = []

    train_acc = 0
    train_loss = 0

    for train_input, train_label in tqdm(train_dataloader):

      # pass this stuff to the GPU
      train_label = train_label.to(device)
      train_labels.append(train_label.cpu().numpy())
      mask = train_input['attention_mask'].to(device)
      input_id = train_input['input_ids'].squeeze(1).to(device)

      # feed data to model
      output = model(input_id, mask)
      train_outputs.append(output.argmax(dim=1).cpu().numpy())

      # calculate loss
      batch_loss = loss(output, train_label.long())
      train_loss += batch_loss.item()

      # calculate accuracy -> likeliest label correct?
      acc = (output.argmax(dim=1) == train_label).sum().item()
      train_acc += acc

      model.zero_grad()
      batch_loss.backward()
      optimizer.step()

    val_acc = 0
    val_loss = 0

    # proper backprop for validation mode
    with torch.no_grad():

      for val_input, val_label in val_dataloader:

        val_label = val_label.to(device)
        val_labels.append(val_label.cpu().numpy())
        mask = val_input['attention_mask'].to(device)
        input_id = val_input['input_ids'].squeeze(1).to(device)

        output = model(input_id, mask)
        val_outputs.append(output.argmax(dim=1).cpu().numpy())

        batch_loss = loss(output, val_label.long())
        val_loss += batch_loss.item()

        acc = (output.argmax(dim=1) == val_label).sum().item()
        val_acc += acc

    avg_train_f1.append(get_f1(train_labels, train_outputs))
    avg_val_f1.append(get_f1(val_labels, val_outputs))

    print(
    f'Epochs: {epoch + 1} | Train Loss: {train_loss / len(train_data): .3f} \
    | Train Accuracy: {train_acc / len(train_data): .3f} \
    | Train F1: {avg_train_f1[-1]: .3f} \
    | Val Loss: {val_loss / len(val_data): .3f} \
    | Val Accuracy: {val_acc / len(val_data): .3f} \
    | Val F1: {avg_val_f1[-1]: .3f}')

We've saved the model and parameters from the run we used to generate our submission. The paths below can be replaced with the generic links given at the top of the notebook.

In [None]:
# Load previously saved model
PATH = '/content/drive/MyDrive/oxml2023mlcases-esg-classifier/esg_transformer_61823.pth'
model = torch.load(PATH)

#model.eval() # -> sets the model to evaluation mode for testing (turn on if just wanting to check the model against test data)

#Load previously saved model parameters
PATH = '/content/drive/MyDrive/oxml2023mlcases-esg-classifier/esg_transformer_61823_params.pth'

model.load_state_dict(torch.load(PATH))

<All keys matched successfully>

In [None]:
#model = BertClassifier() -> this line is used when training from scratch
model = model # -> this line loads the pretrained model from the previous cell for evaluation or further training

epochs = 5
batch_size = 4
learning_rate = 1e-5

train(model, df_train, df_val, learning_rate, epochs)

# Save A Trained Model
We used this cell to save the model and parameters from our successful runs.

In [None]:
#Save a model
PATH = '/content/drive/MyDrive/oxml2023mlcases-esg-classifier/esg_transformer_62223.pth'
torch.save(model, PATH)

#Save a model's parameters
PATH = '/content/drive/MyDrive/oxml2023mlcases-esg-classifier/esg_transformer_62223_params.pth'
torch.save(model.state_dict(), PATH)

# Predict on Sample_Submission.csv

In [None]:
!pip install PyMuPDF

from pathlib import Path
import re
import fitz
import pandas as pd
from PIL import Image
import torch

device = torch.device("cuda"if torch.cuda.is_available() else"cpu")

# directories & files
DIR_DATA = Path("/content/drive/MyDrive/oxml2023mlcases-esg-classifier/data/")
REPORTS_DIR = "reports/"
LABELS_FILE = "labels.csv"

# columns
C_ID, C_CLASS = "id", "class"

In [None]:
submission = pd.read_csv("/content/drive/MyDrive/oxml2023mlcases-esg-classifier/sample_submission.csv")

In [None]:
def create_filepath(filename):
    return DIR_DATA / REPORTS_DIR / filename

def read_page(filename, page_number):
    filepath = create_filepath(filename)
    doc = fitz.open(filepath)
    page_index = page_number - 1
    page = doc.load_page(page_index)
    return page.get_text()

In [None]:
# clean the test data so it matches the train data

t_texts = []

for i in range(len(submission)):
#     print(i)
    path = submission.iloc[i][C_ID]
#     print(path)
    matches = re.match(r'^(.+)\.(\d+)$', path)
    filename = matches.group(1)
    page_number = int(matches.group(2))

    content = read_page(filename, page_number)
    t_texts.append(content)

t_texts = [t.replace('\n', ' ') for t in t_texts]
t_texts = [t.replace('%', ' ') for t in t_texts]
t_texts = [t.replace('$', ' ') for t in t_texts]
t_texts = [t.replace('.', ' ') for t in t_texts]
t_texts = [t.lower() for t in t_texts]
t_texts = [re.sub(pattern, '', t) for t in t_texts]
t_texts = [t.replace('-', '') for t in t_texts]
t_texts = [t.replace('- ', '') for t in t_texts]

texts_df = pd.DataFrame(t_texts, columns=['text'])

In [None]:
# Define an evaluation function

def evaluate(model, df):
  test = [tokenizer(text, padding='max_length', max_length = 512, truncation=True,
                                return_tensors="pt") for text in df['text']]
  pred_list = []

  # These were in the function above
  use_cuda = torch.cuda.is_available()
  device = torch.device("cuda" if use_cuda else "cpu")

  if use_cuda:
    model = model.cuda()

  with torch.no_grad():

    for test_input in test:

      mask = test_input['attention_mask'].to(device)
      input_id = test_input['input_ids'].squeeze(1).to(device)

      output = model(input_id, mask)

      pred = output.argmax()

      pred_list.append(int(pred.detach()))

  return pred_list

In [None]:
# Generate predictions
preds = evaluate(model, texts_df)

In [None]:
# Add predictions to dataframe, in word categories

submission['nums'] = preds

label_mapping = {0: 'governance', 1: 'social', 2: 'environmental', 3: 'other'}
submission['nums'] = submission['nums'].map(label_mapping)

submission.drop(columns=['class'], inplace=True)
submission.rename(columns={'nums': 'class'}, inplace=True)

In [None]:
# Convert submission to CSV
submission.to_csv('submission_621.csv', index=False)