## Technical Model Evaluation for Pre Trained Model

### Import Necessary Libraries

In [3]:
from sklearn.metrics import accuracy_score
import transformers

# BERT Model is what we are comparing too
from transformers import BertTokenizer, BertModel

# Used for importing our pretrained model
from transformers import ElectraTokenizerFast
from transformers import ElectraConfig
from transformers import ElectraModel
from transformers import ElectraForMaskedLM

import torch

# Can use with NVIDIA Cuda
from torch import cuda
from tqdm import tqdm as tqdm

# Model was fine tuned and evaluated using Mac's Metal 3 GPU
if torch.backends.mps.is_available():
    device = torch.device("mps")
    x = torch.ones(1, device=device)
    print (x)
else:
    print ("MPS device not found.")

tensor([1.], device='mps:0')


### Preprocess Data

In [4]:
# Reading in data we will use to fine tune and evaluate model with
X = [line.strip() for line in open('X.txt').readlines()]
y = train_data = [int(line.strip()) for line in open('YL1.txt').readlines()]

len(X), len(y), max(y)

(46985, 46985, 6)

We manually split our data into a proper train and test set. We also allow for us to take a sample of our data to edit/modify the pipeline
without going through the large dataset

In [5]:
# Fix constant for size of data
full_data = 46000
split_percent = .1

# taking 10% of full data
sample_data = int(full_data * split_percent)

# Used for indexed split
sample_split = (sample_data - int(sample_data * .024))

# Taking a sample of the original data
sample_X = X[:sample_data] 
sample_y = y[:sample_data]

# Full test/train
train_X = X[:46000]
train_y = np.array(y[:46000])
test_X = X[46000:]
test_y = np.array(y[46000:])

# Sample test/train
sample_train_X = sample_X[:sample_split]
sample_train_y = np.array(sample_y[:sample_split])
sample_test_X = sample_X[sample_split:]
sample_test_y = np.array(sample_y[sample_split:])

len(train_X), len(train_y), len(test_X), len(test_y)
len(sample_train_X), len(sample_train_y), len(sample_test_X), len(sample_test_y), 

(4490, 4490, 110, 110)

Split up our labels for what documents we have

In [4]:
# not needed for training or evaluation, but useful for mapping examples
labels = {
    0:'Computer Science',
    1:'Electrical Engineering',
    2:'Psychology',
    3:'Mechanical Engineering',
    4:'Civil Engineering',
    5:'Medical Science',
    6:'Biochemistry'
}

len(labels)

7

### Fine-tune BERT on the dataset

#### Label Conversion

This class implements a custom dataset for multi label text classification in PyTorch. In multi label classification, each text example can belong to multiple categories simultaneously (unlike single label classification where each example belongs to exactly one class).

The core purpose is to prepare text data for input into our models for evaluation by:

- Converting raw text into tokenized numerical representations
- Handling multiple target labels for each text example
- Formatting everything in a way PyTorch can efficiently process during training

In [5]:
class MultiLabelDataset(torch.utils.data.Dataset):

    def __init__(self, text, labels, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.text = text
        self.targets = labels
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = self.text[index]
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            truncation=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.long)
        }

#### Adding pre-trained BERT model

The `BERTClass` sets up the BERT model using the pre-trained `bert-base-uncased` model from hugging face. `NUM_OUT` is the number of outputs available to the model. In our case this number is 7.

In [59]:
class BERTClass(torch.nn.Module):
    def __init__(self, NUM_OUT):
        super(BERTClass, self).__init__()
                   
        self.l1 = BertModel.from_pretrained("bert-base-uncased")
        self.classifier = torch.nn.Linear(768, NUM_OUT)
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        output = self.classifier(pooler)
        output = self.softmax(output)
        return output

### Adding pre-trained ELECTRA model

The `ELECTRAClass` is similar to the pre-trained BERT Model, but we are importing our model from the directory rather than hugging face. Sense we used masked wording for our pre-training we need to bypass the initial output of the model and pass the final layer before our prediction to a classifier for our evaluation

In [91]:
class ELECTRAClass(torch.nn.Module):
    def __init__(self, NUM_OUT):
        super(ELECTRAClass, self).__init__()
        # Load in configurations
        self.config = ElectraConfig.from_json_file("config.json")
        
        # Use ElectraModel instead of ElectraForMaskedLM
        self.l1 = ElectraForMaskedLM(self.config)
        
        # Load in model with pre trained weights
        state_dict = torch.load("babylm_model.bin", map_location=torch.device('cpu'))

        # Loading weights into model configerations
        self.l1.load_state_dict(state_dict)
        
        self.classifier = torch.nn.Linear(self.config.hidden_size, NUM_OUT)
        self.softmax = torch.nn.Softmax(dim=1)

    def forward(self, input_ids, attention_mask, token_type_ids):

        # Inputs the data to the encoder layers of electra then bypassing the 
        # mask prediction.
        output_1 = self.l1.electra(
                input_ids=input_ids, 
                attention_mask=attention_mask,
                token_type_ids=token_type_ids
        )

        # Grabs out outputs from the last encoder layer
        hidden_state = output_1.last_hidden_state 
        pooler = hidden_state[:, 0]

        # Pass our vectors to classifier for final prediction
        output = self.classifier(pooler)
        output = self.softmax(output)
        return output

model layout for reference

In [89]:
model = ELECTRAClass(NUM_OUT)
model

ELECTRAClass(
  (l1): ElectraForMaskedLM(
    (electra): ElectraModel(
      (embeddings): ElectraEmbeddings(
        (word_embeddings): Embedding(30522, 64, padding_idx=0)
        (position_embeddings): Embedding(512, 64)
        (token_type_embeddings): Embedding(2, 64)
        (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (embeddings_project): Linear(in_features=64, out_features=196, bias=True)
      (encoder): ElectraEncoder(
        (layer): ModuleList(
          (0-17): 18 x ElectraLayer(
            (attention): ElectraAttention(
              (self): ElectraSelfAttention(
                (query): Linear(in_features=196, out_features=196, bias=True)
                (key): Linear(in_features=196, out_features=196, bias=True)
                (value): Linear(in_features=196, out_features=196, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): E

### Helper Model Functions

#### Loss Function

In [8]:
def loss_fn(outputs, targets):
    return torch.nn.CrossEntropyLoss(label_smoothing=0.001)(outputs, targets)

#### Train Function

In [9]:
def train(model, training_loader, optimizer):
    model.train()
    for data in tqdm(training_loader):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)

        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    return loss

#### Validation

In [10]:
def validation(model, testing_loader):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for data in tqdm(testing_loader):
            targets = data['targets']
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            outputs = model(ids, mask, token_type_ids)
            outputs = torch.softmax(outputs, dim=1).cpu().detach()
            fin_outputs.extend(outputs)
            fin_targets.extend(targets)
    return torch.stack(fin_outputs), torch.stack(fin_targets)

#### Tokenizer

In [41]:
# For our Base Bert Model
# tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# For our Electra model
tokenizer = ElectraTokenizerFast.from_pretrained('google/electra-base-discriminator')

#### Training Setup

### Model Evaluation

The following parameters were gathered from trial and error of finding what has the best accuracy.

In [96]:
MAX_LEN = 64 # 64
BATCH_SIZE = 16 # Try 32
EPOCHS = 3
NUM_OUT = 7
LEARNING_RATE = 1e-05 #1e-05

training_data = MultiLabelDataset(train_X, torch.from_numpy(train_y), tokenizer, MAX_LEN)
test_data = MultiLabelDataset(test_X, torch.from_numpy(test_y), tokenizer, MAX_LEN)

train_params = {'batch_size': BATCH_SIZE,
                'shuffle': False,
                'num_workers': 0
                }

test_params = {'batch_size': BATCH_SIZE,
                'shuffle': False,
                'num_workers': 0
                }    

training_loader = torch.utils.data.DataLoader(training_data, **train_params)
testing_loader = torch.utils.data.DataLoader(test_data, **test_params)

In [97]:
# model = BERTClass(NUM_OUT)
model = ELECTRAClass(NUM_OUT)
model.to(device)    

optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

for epoch in range(EPOCHS):
    loss = train(model, training_loader, optimizer)
    print(f'Epoch: {epoch + 1}, Loss:  {loss}')  
    guess, targs = validation(model, testing_loader)
    guesses = torch.argmax(guess, dim=1)
    print('arracy on test set {}'.format(accuracy_score(y_pred=guesses, y_true=targs)))

  'targets': torch.tensor(self.targets[index], dtype=torch.long)
100%|███████████████████████████████████████| 2875/2875 [06:52<00:00,  6.97it/s]


Epoch: 1, Loss:  1.8762006759643555


100%|███████████████████████████████████████████| 62/62 [00:01<00:00, 38.13it/s]


arracy on test set 0.1736040609137056


  'targets': torch.tensor(self.targets[index], dtype=torch.long)
100%|███████████████████████████████████████| 2875/2875 [06:50<00:00,  7.00it/s]


Epoch: 2, Loss:  1.663584589958191


100%|███████████████████████████████████████████| 62/62 [00:01<00:00, 40.61it/s]


arracy on test set 0.38578680203045684


  'targets': torch.tensor(self.targets[index], dtype=torch.long)
100%|███████████████████████████████████████| 2875/2875 [06:54<00:00,  6.94it/s]


Epoch: 3, Loss:  1.5776500701904297


100%|███████████████████████████████████████████| 62/62 [00:01<00:00, 39.88it/s]

arracy on test set 0.515736040609137





## End Evaluation

### Model Analysis

#### Performance on batch size 8

**EPOCH 1**

BERT 75% Accuracy | Loss 1.39

ELECTRA 57% | Loss 1.49

**EPOCH 2**

BERT 78% | Loss 1.29

ELECTRA 69% | Loss 1.39

**EPOCH 3**

BERT 77% | Loss 1.29

ELECTRA 70%  | Loss 1.32

#### Performance on batch 16

**EPOCH 1**

BERT 77% 

ELECTRA 17% | Loss 1.87

**EPOCH 2**

BERT 77%

ELECTRA 38% | Loss 1.66

**EPOCH 8**

BERT 78%

ELECTRA 52% | Loss 1.58


#### Performance on batch 32

**EPOCH 1**

BERT 75% | Loss 1.4

TinyBERT 66% | Loss 1.41

ELECTRA 17% | Loss 1.9

**EPOCH 2**

BERT 78% | Loss 1.39

TinyBERT 67% | Loss 1.37

ELECTRA 17% | Loss 1.8

**EPOCH 2**

BERT 78% | Loss 1.28

TinyBERT 70% | Loss 1.41

ELECTRA 17% | Loss 1.7

#### Summary of Batch size for BERT

**Batch Size Increase for BERT Model**

With a batch size of *8* the model:

- Tended to improve very well after the first epoch
- Performance stayed the same and fell a tad bit after the 3rd epoch

With a batch size of *16* the model:

- Had a non-volatile increase from the first to the last epoch
- Had very similar results as a batch size of 8 but stayed more consistent
- Increased gradually and didn't decrease

With a batch size of *32* the model:

- Was very similar to a batch size of 8 in terms of volatility
- Had a higher, more consistent accuracy score at the end

#### Summary of ELECTRA Model

**Batch Size Increase for ELECTRA Model**

With a batch size of *8* the model:

- Improved significantly compared to the previous batch size's
- Performance increased had a logarithmic growth, with increased improvement from epoch 1 to epoch 2, but only a small improvement from epoch 2 to epoch 3

With a batch size of *16* the model:

- Had a large improvement from epoch to epoch. 
- Had similar accuracy at the start when compared to batch 32, but was able to pick up accuracy.

With a batch size of *32* the model:

- Volatility of improvement was super low
- Did not change accuracy at all

#### BERT vs ELECTRA 

BERT was significantly more consistent when translating across different batch sizes. BERT's accuracy overall was better than ELECTRA's for batch sizes 16 and 32. For batch size 8, ELECTRA slightly caught up to BERT's accuracy but still didn't surpass it.
ELECTRA was much faster during the fine-tuning process and showed better accuracy improvements with each epoch, but still didn't perform as well on the task as BERT.

#### Computation with Fine-Tuning VS Pre-Training

Pre Training takes significantly more time computing compared to fine-tuning due to the following factors:

- Pre-Training:

    - Pre-training a model involves starting from complete scratch as we begin with random weights (parameters) and try to guess a word through masking. This involves tuning a significant number of parameters each time we pass through our model and gradually improve the accuracy of word prediction.

    - In order to guess these words without any prior context, we need a significant amount of data for reference. As we pass through the model, we analyze a dataset to gather more context to make better predictions.

    - A good analogy for this is like starting in preschool. Children have little prior knowledge of concepts relating to academics, so we slowly build on concepts and learn incrementally in order to build a solid foundation on the general context of academia.

- Fine-Tuning: 

    - Fine-tuning a model involves using an existing model and adapting it to improve knowledge in a specific area. Since we already have a solid foundation, we don't need to adjust each parameter as much to achieve better accuracy.

    - With more initial context, less data is needed to improve the accuracy for a particular task. This allows for reduced analysis overall when working with datasets.

    - A good analogy for this is like starting in college. High school graduates have decent prior knowledge of academic concepts. We can now expand the knowledge we already have on specialized concepts to build a solid foundation in a specific area.