<a href="https://colab.research.google.com/github/Parisa-Foroutan/Deep-Learning/blob/main/Scientific_paper_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Kaggle Competition

Author: Parisa Foroutan
For this task, I have tried to fine-tune the pretrained models from the Hugging Face library. I have tested models such as BertForSequenceClassification,RobertaForSequenceClassification, XLNetForSequenceClassification, AlbertForSequenceClassification, and  DebertaForSequenceClassification. Among these, I got the highest accuracy with utilizing microsoft/deberta-base tokenizer and the pretrained model and the next best model was bert-based-uncased. I have also tried using large models such as Ber-large or DeBerta-large, but it was not possible to run due to cuda memory limitations.

I have used one sequence of concatinating the title and the abstract and tried to cleen the data a little more to get rid of some rare tokens. For the validation, I have used 10% of the training dataset.

Moreover, I have tried various hyperparameter settings by changing the MAX_LEN (max of sequence length), batch_size, learning rate, value for gradient clipping, hidden_dropout_prob, and attention_probs_dropout_prob.

In [None]:
# Install Hugging face library
!pip install transformers
# for using XLNet
!pip install SentencePiece 

output.clear()

## Import general libraries

In [None]:
import os
import math

import torch
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.cuda.empty_cache()

## Loading Dataset


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/HW2_DL_document_classification/my_train.csv', header=0, encoding='utf-8')
# df = pd.read_csv('train.csv', header=0, encoding='utf-8')
# to use both title and abstract:
df['text'] = df.iloc[:, 3] + " " + df.iloc[:, 4]
df = df.drop(df.columns[[1, 2, 3, 4]], axis=1)

print(df.shape[0])
df.sample(3)

60000


Unnamed: 0,label,text
23550,16,centernet keypoint triplets for object detecti...
35905,14,symbolic versus numerical computation and visu...
45182,6,living without a mobile phone an autoethnograp...


In [None]:
# prepare the list of sentences and labels
sentences = df.text.values
labels = df.label.values

In [None]:
# function to clean-up the data
def clean_sequence(sent):
    # Removing the @
    sent = re.sub(r"@[A-Za-z0-9]+", ' ', sent)
    # Removing the $
    sent = re.sub(r"$\?[A-Za-z0-9]+", '$', sent)
    # Removing the URL links
    sent = re.sub(r"https?://[A-Za-z0-9./]+", ' ', sent)
    # Keeping only letters
    sent = re.sub(r"[^a-zA-Z0-9.()!?'%$]", ' ', sent)
    # Removing additional whitespaces
    sent = re.sub(r" +", ' ', sent)
    return sent

In [None]:
sentences = [clean_sequence(sent) for sent in sentences]

## Tokenization & Data Preprocessing

In [None]:
from transformers import BertTokenizer, RobertaTokenizer, XLNetTokenizer, AlbertTokenizer, DebertaTokenizer

tokenizer = DebertaTokenizer.from_pretrained('microsoft/deberta-base', do_lower_case=True)
# tokenizer = AlbertTokenizer.from_pretrained('albert-base-v2', do_lower_case=True)
# tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=True)
# tokenizer = RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=True) 
# tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased', do_lower_case=True)

### Sentence Length & Attention Mask

In [None]:
# Tokenize all of the sentences
input_ids = []

for sent in sentences:
    encoded_sent = tokenizer.encode(sent, add_special_tokens = True)
    input_ids.append(encoded_sent)

print('Sentence: ', sentences[0])
print('Token IDs: ', input_ids[0])

Token indices sequence length is longer than the specified maximum sequence length for this model (731 > 512). Running this sequence through the model will result in indexing errors


Sentence:  evasion attacks against machine learning at test time In security sensitive applications the success of machine learning depends on a thorough vetting of their resistance to adversarial data. In one pertinent well motivated attack scenario an adversary may attempt to evade a deployed system at test time by carefully manipulating attack samples. In this work we present a simple but effective gradient based approach that can be exploited to systematically assess the security of several widely used classification algorithms against evasion attacks. Following a recently proposed framework for security evaluation we simulate attack scenarios that exhibit different risk levels for the classifier by increasing the attacker's knowledge of the system and her ability to manipulate attack samples. This gives the classifier designer a better picture of the classifier performance under evasion attacks and allows him to perform a more informed model selection (or parameter setting). We ev

### Padding & Truncating

In [None]:
print('Max sentence length: ', max([len(sen) for sen in input_ids]))

Max sentence length:  1009


In [None]:
from keras.preprocessing.sequence import pad_sequences
# the maximum possible length in the pretrained models is 512, however we have longer sequences. So we have to truncate them.
# In some cases, even using 512 is not possible due to cuda limitations.
MAX_LEN = 450
# Pad or truncate all sentences to # of MAX_LEN
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", 
                          value=0, truncating="post", padding="post")
print(f'\nPadding token: {tokenizer.pad_token} , ID: {tokenizer.pad_token_id}')


Padding token: [PAD] , ID: 0


**Attention Masks:**

The attention mask will distinguish the real tokens from padding tokens.

In [None]:
attention_masks = []

for sent in input_ids:
    att_mask = [int(token_id > 0) for token_id in sent]
    attention_masks.append(att_mask)

### split data to training and validation

In [None]:
from sklearn.model_selection import train_test_split
# Use 90% for training and 10% for validation.
train_inputs, valid_inputs, train_labels, valid_labels = train_test_split(input_ids, labels, 
                                                            random_state=12, test_size=0.10)

train_masks, valid_masks, _, _ = train_test_split(attention_masks, labels,
                                             random_state=12, test_size=0.10)

# Convert all inputs and labels into torch tensors
train_inputs = torch.tensor(train_inputs)
valid_inputs = torch.tensor(valid_inputs)
train_masks = torch.tensor(train_masks)
valid_masks = torch.tensor(valid_masks)
train_labels = torch.tensor(train_labels)
valid_labels = torch.tensor(valid_labels)

In [None]:
# Creating DataLoader objects
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = 6

# DataLoader for training
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
# DataLoader for validation
valid_data = TensorDataset(valid_inputs, valid_masks, valid_labels)
valid_sampler = SequentialSampler(valid_data)
valid_dataloader = DataLoader(valid_data, sampler=valid_sampler, batch_size=batch_size)

##  Train the Text Classification Model

### Pretrained Models from Hugging Face


I have used pretrained models (such as BertForSequenceClassification) from the Hugging Face library. 

In [None]:
from transformers import BertForSequenceClassification, AdamW, BertConfig, RobertaForSequenceClassification, \
                         RobertaConfig, XLNetForSequenceClassification, XLNetConfig, AlbertForSequenceClassification,\
                         DebertaForSequenceClassification


# config = RobertaConfig.from_pretrained('roberta-base', num_labels = 20)
# config.hidden_dropout_prob  = 0.15
# config.attention_probs_dropout_prob = 0.15

# model = BertForSequenceClassification.from_pretrained("bert-base-uncased",  num_labels = 20)
model = DebertaForSequenceClassification.from_pretrained("microsoft/deberta-base",  num_labels = 20)
# model = AlbertForSequenceClassification.from_pretrained("albert-base-v2",  num_labels = 20)
# model = RobertaForSequenceClassification.from_pretrained("roberta-base", config = config)
# model = XLNetForSequenceClassification.from_pretrained("xlnet-base-cased", config = config)

model.cuda()

Some weights of the model checkpoint at microsoft/deberta-base were not used when initializing DebertaForSequenceClassification: ['lm_predictions.lm_head.bias', 'lm_predictions.lm_head.dense.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'config']
- This IS expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DebertaForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-base and are newly initialized: ['pooler.dense.weight

DebertaForSequenceClassification(
  (deberta): DebertaModel(
    (embeddings): DebertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=0)
      (LayerNorm): DebertaLayerNorm()
      (dropout): StableDropout()
    )
    (encoder): DebertaEncoder(
      (layer): ModuleList(
        (0): DebertaLayer(
          (attention): DebertaAttention(
            (self): DisentangledSelfAttention(
              (in_proj): Linear(in_features=768, out_features=2304, bias=False)
              (pos_dropout): StableDropout()
              (pos_proj): Linear(in_features=768, out_features=768, bias=False)
              (pos_q_proj): Linear(in_features=768, out_features=768, bias=True)
              (dropout): StableDropout()
            )
            (output): DebertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): DebertaLayerNorm()
              (dropout): StableDropout()
            )
          )
          (intermed

###  Optimizer & Learning Rate Scheduler

In [None]:
from transformers import get_linear_schedule_with_warmup
epochs = 4

# AdamW is a class from the huggingface library ('Weight Decay fix") 
optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8)
# Total number of training steps
total_steps = len(train_dataloader) * epochs
# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps = total_steps)

### Training Loop

In [None]:
# Function to calculate the accuracy of our predictions vs labels
def accuracy_score(preds, labels):
    predicted = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(predicted == labels_flat) / len(labels_flat)

# function for formatting elapsed times
import time
import datetime
def format_time(elapsed):
    '''
    report time in hh:mm:ss
    '''
    elapsed_rounded = int(round((elapsed)))
    return str(datetime.timedelta(seconds=elapsed_rounded))

I couldn't use large models such as Ber-large or DeBerta-large because of the following GPU limitations:
```
bert-large-uncased:  RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 14.76 GiB total capacity; 13.59 GiB already allocated; 125.75 MiB free; 13.61 GiB reserved in total by PyTorch)
```

In [None]:
# average loss after each epoch
loss_values = []

for epoch in range(0, epochs):

    # checkpoint = torch.load('/content/drive/MyDrive/Datasets/HW2_DL_document_classification/checkpoint/model.pt')
    # model.load_state_dict(checkpoint['model_state_dict'])
    # optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    # epoch = checkpoint['epoch'] + 1
    # loss = checkpoint['loss']

        ## Training ##
    print(f'====== Epoch {epoch + 1} / {epochs} ======')
    print('Training...')
 
    t0 = time.time()
    # Reset the total loss for this epoch.
    total_loss = 0

    model.train()
    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        if step % 500 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # batch: (input ids, attention masks, labels)  
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        model.zero_grad()        
        
        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        
        loss = outputs[0]
        # Accumulate the training loss over all of the batches 
        total_loss += loss.item()
        # backward pass
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 2.0)
        optimizer.step()
        scheduler.step()
    torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'loss': loss,
            }, '/content/drive/MyDrive/Datasets/HW2_DL_document_classification/checkpoint/model.pt')
    # Calculate the average loss over the training data.
    avg_train_loss = (total_loss / len(train_dataloader)).detach().cpu().item()   
    
    # Store the loss value for plotting the learning curve.
    loss_values.append(avg_train_loss)
    print("")

    print(f"  Average training loss: {avg_train_loss}")
    print(f"  Training time for this epoch: {format_time(time.time() - t0)}")
        
    ## Validation ##

    print("\nValidation...")
    t0 = time.time()
    valid_loss, valid_accuracy, best_valid_acc = 0, 0, 0
    nb_valid_steps, nb_valid_examples = 0, 0

    model.eval()

    for batch in valid_dataloader:
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        
        with torch.no_grad():        

            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
        
        logits = outputs[0]
        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # batch accuracy
        tmp_valid_accuracy = accuracy_score(logits, label_ids)  
        # update total accuracy.
        valid_accuracy += tmp_valid_accuracy
        nb_valid_steps += 1

    # Report validation accuracy.
    print(f"  Validation Accuracy: {valid_accuracy/nb_valid_steps}")
    print(f"  Validation time: {format_time(time.time() - t0)}")

In [None]:
fig = plt.figure(figsize=(6,3))
plt.plot(range(epochs), loss_values, label='loss')
plt.title('Training loss of the Model')
plt.xlabel('epochs')
plt.ylabel('loss')

## Performance On Test Set

### Data Preparation

In [None]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/HW2_DL_document_classification/my_test.csv', header=0, encoding='utf-8')
# df = pd.read_csv('my_test.csv', header=0, encoding='utf-8')
df['text'] = df.iloc[:, 2] + " " + df.iloc[:, 3]
print('Number of papers in test set: {:,}\n'.format(df.shape[0]))
print(df.sample(3))
sentences = df.text.values

Number of papers in test set: 13,718

       node id  ...                                               text
4691    148592  ...  modular tracking framework a unified approach ...
11752   164862  ...  cartesian effect categories are freyd categori...
7277    154568  ...  baby step giant step algorithms for the symmet...

[3 rows x 5 columns]


In [None]:
sentences = [clean_sequence(sent) for sent in sentences]

In [None]:
input_ids = []

for sent in sentences:
    encoded_sent = tokenizer.encode(sent, add_special_tokens = True)
    
    input_ids.append(encoded_sent)
# Pad input tokens
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, 
                          dtype="long", truncating="post", padding="post")

attention_masks = []
# Create attention_masks
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask) 

prediction_inputs = torch.tensor(input_ids)
prediction_masks = torch.tensor(attention_masks)

batch_size = 1 
# DataLoader
prediction_data = TensorDataset(prediction_inputs, prediction_masks)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

### Evaluate on Test Set

In [None]:
# Prediction on test set
print(f'Predicting labels for {len(prediction_inputs)} test sentences...')

# model.load_state_dict(best_state_dict)

model.eval()

predictions = []

for test_batch in prediction_dataloader:
  # Add test_batch to GPU
  test_batch = tuple(t.to(device) for t in test_batch)
  b_input_ids, b_input_mask = test_batch
  
  with torch.no_grad():
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
  logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  predictions.append(logits)

Predicting labels for 13718 test sentences...


### Saving my predictions

In [None]:
my_labels = pd.DataFrame(np.argmax(predictions, axis=2))
my_labels['node id'] = df['node id']
my_labels.columns = ['label', 'node id']
my_labels = my_labels[['node id', 'label']]
my_labels.to_csv('sorted_predictions_bert_clean_450.csv')
my_labels.to_csv('/content/drive/MyDrive/Datasets/HW2_DL_document_classification/sorted_predictions_bert_clean_450.csv')