## Abstract

To create appropriate datasets, I did some basic operations on the given CSV files and created two datasets with the names "TrainData" and "TestData". I used pre-trained transfer models and since the data pre-processing operations are handled inside these models, the only pre-processing that I used is removing the punctuations. 

---

Below is a summary of my two best submission scores on [this Kaggle public leaderboard](https://www.kaggle.com/competitions/math80600aw21/leaderboard?tab=public). The two pre-trained models have been fine-tuned. In this document, I have provided only the code for the first model ("roberta-large") as it has given the best test accuracy. 



|   Model       |Max_Len|batch_size| lr |eps |epochs|avg train loss|validation accuracy|submission score|
|---------------|-------|----------|----|----|------|--------------|-------------------|----------------|
| roberta_large |  300  |     10   |1e-5|1e-6|  4   |     0.23     |     0.8502        |     0.85982    |
| deberta_large |  256  |     6    |1e-5|1e-8|  4   |     0.18     |     0.8528        |     0.85784    |




In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
## Set hyper-parameters
## These values have been chosen based on my best results, considering the limitations of the cuda memory. 
seed_val = 67
Max_Len = 300
batch_size = 10
epochs = 4
lr = 1e-5
eps = 1e-6

## Loading the datasets

In [None]:
## Read "TrainData.csv" from the respective folder on your Google Drive"
import pandas as pd
df = pd.read_csv("/content/drive/MyDrive/Kaggle-Bahareh/TrainData.csv", error_bad_lines=False)
df

Unnamed: 0,label,nodeid,paperid,title,abstract
0,4,0,9657784,evasion attacks against machine learning at te...,"In security-sensitive applications, the succes..."
1,5,1,39886162,how hard is computing parity with noisy commun...,We show a tight lower bound of $\Omega(N \log\...
2,8,3,121432379,a promise theory perspective on data networks,Networking is undergoing a transformation thro...
3,6,6,1444859417,webvrgis based city bigdata 3d visualization a...,This paper shows the WEBVRGIS platform overlyi...
4,4,7,1483430697,information theoretic authentication and secre...,"In the splitting model, information theoretic ..."
...,...,...,...,...,...
59995,5,137822,2342249457,incentivizing users of data centers participat...,Demand response is widely employed by todayâ€™...
59996,16,137823,2343427588,semantic change detection with hypermaps,Change detection is the study of detecting cha...
59997,9,137827,2347853400,computing with polynomial ordinary differentia...,"In 1941, Claude Shannon introduced the General..."
59998,8,137830,2399648051,on energy efficiency in wireless networks a ga...,We develop a game-theoretic framework to inves...


In [None]:
## Read "TestData.csv" from the respective folder on your Google Drive"
df2 = pd.read_csv("/content/drive/MyDrive/Kaggle-Bahareh/TestData.csv", error_bad_lines=False)
df2

Unnamed: 0,nodeid,paperid,title,abstract
0,137832,2403725649,patchlift fast and exact computation of patch ...,"In this paper, we propose a fast algorithm cal..."
1,137833,2404740077,the unreasonable effectiveness of address clus...,Address clustering tries to construct the one-...
2,137834,2407125866,end to end goal driven web navigation,We propose a goal-driven web navigation as a b...
3,137836,2408327416,complexity measures for map reduce and compari...,The programming paradigm Map-Reduce and its ma...
4,137837,2412021890,a parallel implementation of the ensemble kalm...,This paper discusses an efficient parallel imp...
...,...,...,...,...
13713,169336,3011349285,confidence guided stereo 3d object detection w...,Accurate and reliable 3D object detection is v...
13714,169338,3011696425,sentinet detecting localized universal attacks...,SentiNet is a novel detection framework for lo...
13715,169340,3011798063,learning compositional rules via neural progra...,"Many aspects of human reasoning, including lan..."
13716,169341,3012226457,certified defenses for adversarial patches,Adversarial patch attacks are among one of the...


## Data PreProcessing

In [None]:
## TrainData
## concatenation of "title" and "abstract" columns and removing other columns except for "label" column
df['text'] = df.iloc[:, 3] + " " + df.iloc[:, 4]
df = df.drop(df.columns[[1,2,3,4]], axis=1)
df = df[['text', 'label']]
df

Unnamed: 0,text,label
0,evasion attacks against machine learning at te...,4
1,how hard is computing parity with noisy commun...,5
2,a promise theory perspective on data networks ...,8
3,webvrgis based city bigdata 3d visualization a...,6
4,information theoretic authentication and secre...,4
...,...,...
59995,incentivizing users of data centers participat...,5
59996,semantic change detection with hypermaps Chang...,16
59997,computing with polynomial ordinary differentia...,9
59998,on energy efficiency in wireless networks a ga...,8


In [None]:
## TestData 
## concatenation of "title" and "abstract" columns and removing other columns 
df2['text'] = df2.iloc[:, 2] + " " + df2.iloc[:, 3]
df2 = df2.drop(df2.columns[[0,1,2,3]], axis=1)
df2

Unnamed: 0,text
0,patchlift fast and exact computation of patch ...
1,the unreasonable effectiveness of address clus...
2,end to end goal driven web navigation We propo...
3,complexity measures for map reduce and compari...
4,a parallel implementation of the ensemble kalm...
...,...
13713,confidence guided stereo 3d object detection w...
13714,sentinet detecting localized universal attacks...
13715,learning compositional rules via neural progra...
13716,certified defenses for adversarial patches Adv...


In [None]:
## A function to remove punctuations and other symbols
import string

punct =[]
punct += list(string.punctuation)
punct += '’'
punct.remove("'")
def remove_punctuations(text):
    for punctuation in punct:
        text = text.replace(punctuation, ' ')
    return text
print(punct)

['!', '"', '#', '$', '%', '&', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '’']


In [None]:
## Remove punctuations from "text" data column of TrainData

df['text'] = df['text'].apply(remove_punctuations)
df['text'] = df['text'].apply(lambda x: str(x).replace("  ", " "))
df.to_csv('TrainData_removed_punctuations.csv')

## Looking at an example
df['text'].iloc[0]

"evasion attacks against machine learning at test time In security sensitive applications the success of machine learning depends on a thorough vetting of their resistance to adversarial data In one pertinent well motivated attack scenario an adversary may attempt to evade a deployed system at test time by carefully manipulating attack samples In this work we present a simple but effective gradient based approach that can be exploited to systematically assess the security of several widely used classification algorithms against evasion attacks Following a recently proposed framework for security evaluation we simulate attack scenarios that exhibit different risk levels for the classifier by increasing the attacker's knowledge of the system and her ability to manipulate attack samples This gives the classifier designer a better picture of the classifier performance under evasion attacks and allows him to perform a more informed model selection or parameter setting  We evaluate our appro

In [None]:
## Remove punctuations from "text" data column of TestData

df2['text'] = df2['text'].apply(remove_punctuations)
df2['text'] = df2['text'].apply(lambda x: str(x).replace("  ", " "))
df2.to_csv('TestData_removed_punctuations.csv')

## Looking at an example
df2['text'].iloc[0]

'patchlift fast and exact computation of patch distances using lifting with applications to non local means In this paper we propose a fast algorithm called PatchLift for computing distances between patches extracted from a one dimensional signal PatchLift is based on the observation that the patch distances can be expressed in terms of simple moving sums of an image which is derived from the one dimensional signal via lifting We apply PatchLift to develop a separable extension of the classical Non Local Means NLM algorithm which is at least 100 times faster than NLM for standard parameter settings The PSNR obtained using the proposed extension is typically close to and often larger than the PSNRs obtained using the original NLM We provide some simulations results to demonstrate the acceleration achieved using separability and PatchLift '

## Model

In [None]:
import numpy as np
import pandas as pd
import torch
from google.colab import output

## install the transformers package from Hugging Face
!pip install transformers
output.clear() 

device = torch.device("cuda")
print(torch.cuda.get_device_name(0))

Tesla P100-PCIE-16GB


To fine-tune the pre-trained RoBERTa model for solving this text classification problem, we need to split each text into the tokens and for this, we have to use the tokenizer provided by the model.
The tokenizer of the model splits the texts into tokens and then adds the special [CLS] and [SEP] tokens. In the end, it maps the tokens to their index in the tokenizer vocabulary.


In [None]:
## Tokenize the texts and map them to their index

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-large')

texts = df.text.values
labels = df.label.values

input_ids = []
for text in texts:
    encoded_text = tokenizer.encode(text, add_special_tokens = True,)
    input_ids.append(encoded_text)

## looking at an example
print('text:', texts[3])
print('Tokenized text:', tokenizer.tokenize(texts[3]))
print('input IDs:', input_ids[3])

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




Token indices sequence length is longer than the specified maximum sequence length for this model (710 > 512). Running this sequence through the model will result in indexing errors


text: webvrgis based city bigdata 3d visualization and analysis This paper shows the WEBVRGIS platform overlying multiple types of data about Shenzhen over a 3d globe The amount of information that can be visualized with this platform is overwhelming and the GIS based navigational scheme allows to have great flexibility to access the different available data sources For example visualising historical and forecasted passenger volume at stations could be very helpful when overlaid with other social data 
Tokenized text: ['web', 'vr', 'g', 'is', 'Ġbased', 'Ġcity', 'Ġbig', 'data', 'Ġ3', 'd', 'Ġvisualization', 'Ġand', 'Ġanalysis', 'ĠThis', 'Ġpaper', 'Ġshows', 'Ġthe', 'ĠWE', 'B', 'VR', 'G', 'IS', 'Ġplatform', 'Ġover', 'lying', 'Ġmultiple', 'Ġtypes', 'Ġof', 'Ġdata', 'Ġabout', 'ĠShen', 'zhen', 'Ġover', 'Ġa', 'Ġ3', 'd', 'Ġglobe', 'ĠThe', 'Ġamount', 'Ġof', 'Ġinformation', 'Ġthat', 'Ġcan', 'Ġbe', 'Ġvisual', 'ized', 'Ġwith', 'Ġthis', 'Ġplatform', 'Ġis', 'Ġoverwhelming', 'Ġand', 'Ġthe', 'ĠG', 'IS',

The texts in our dataset have different lengths and we need to truncate them to a fixed length to be able to feed them into the model. Here, we use a histogram to get a distribution of the texts' lengths in our TrainData. 

In [None]:
## Get length of all texts in TrainData

seq_len = [len(i.split()) for i in df['text']]
pd.Series(seq_len).hist(bins = 20)

The maximum length of sequences allowed in the RoBERTa model is 512. We need to truncate longer text sequences in the TrainData to 512. According to the histogram of the texts' lengths in TrainData, most of the texts have a length of less than 400. Because of Cuda memory limitations, the max_len parameter of the model is set to 300. However, we lose some information of texts with longer lengths. Texts that have a length of less than this value are padded with a [PAD] token. Therefore, all texts will have the same length. 

In [None]:
## padding/truncating
from keras.preprocessing.sequence import pad_sequences
Max_Len = Max_Len
print('\nPadding/truncating all texts to %d values...' % Max_Len)
input_ids = pad_sequences(input_ids, maxlen=Max_Len, dtype="long", value=0, truncating="post", padding="post")
# "post" indicates that the padding and truncation is being done at the end of the sequence not the begining


Padding/truncating all texts to 300 values...


In [None]:
## Create attention masks

## The “Attention Mask” is an array of 1s and 0s indicating which tokens are padding and which are actual words
attention_masks = []
for sent in input_ids:
    att_mask = [int(token_id > 0) for token_id in sent]
    attention_masks.append(att_mask)

In [None]:
## splitting: Use 80% for training and 20% for validation
from sklearn.model_selection import train_test_split
train_inputs, valid_inputs, train_labels, valid_labels = train_test_split(input_ids, labels,random_state=seed_val, test_size=0.2)
train_masks, valid_masks, _, _ = train_test_split(attention_masks, labels, random_state=seed_val, test_size=0.2)

In [None]:
## Convert the ndarrays into torch tensors, i.e. the format that is acceptable to the model
train_inputs = torch.tensor(train_inputs)
valid_inputs = torch.tensor(valid_inputs)
train_labels = torch.tensor(train_labels)
valid_labels = torch.tensor(valid_labels)
train_masks = torch.tensor(train_masks)
valid_masks = torch.tensor(valid_masks)

In [None]:
## Create the DataLoaders
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = batch_size      

train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

valid_data = TensorDataset(valid_inputs, valid_masks, valid_labels)
valid_sampler = SequentialSampler(valid_data)
valid_dataloader = DataLoader(valid_data, sampler=valid_sampler, batch_size=batch_size)

In [None]:
## Load RobertaForSequenceClassification
# This is the RoBERTa model with an added single linear layer on top for classification task

import random
import gc
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
gc.collect()
torch.cuda.empty_cache()

from transformers import RobertaForSequenceClassification, RobertaConfig
model = RobertaForSequenceClassification.from_pretrained("roberta-large",num_labels = 20, output_attentions = False, output_hidden_states = False,)
model.cuda()
output.clear()

In [None]:
## optimizer
from transformers import AdamW, get_linear_schedule_with_warmup
optimizer = AdamW(model.parameters(), lr = lr, eps = eps) 
epochs = epochs
## Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs
## Create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)

In [None]:
## Function to calculate the accuracy 
def accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [None]:
## Training

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
gc.collect()
torch.cuda.empty_cache()

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

loss_values = []
for epoch_i in range(0, epochs):
    print('Epoch {:} / {:}'.format(epoch_i + 1, epochs))
    total_loss = 0
    model.train()
    for step, batch in enumerate(train_dataloader):
        if step % 100 == 0 and not step == 0:
            print('  Batch {:>5,}  of  {:>5,}  Loss: {}'.format( step, len(train_dataloader), loss))
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        model.zero_grad()        
        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        loss = outputs[0]
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)   # This could prevent the "exploding gradients" problem
        optimizer.step()
        scheduler.step()  # Updates the learning rate
    avg_train_loss = total_loss / len(train_dataloader)            
    loss_values.append(avg_train_loss)
    print("  Average training loss: {0:.2f}".format(avg_train_loss))

## Evaluation on the Validation set
    print("Running Validation...")
    model.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    for batch in valid_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        with torch.no_grad():        
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)
        logits = outputs[0]
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        tmp_eval_accuracy = accuracy(logits, label_ids)
        eval_accuracy += tmp_eval_accuracy
        nb_eval_steps += 1
    print(" Validation Accuracy: {0:.4f}".format(eval_accuracy/nb_eval_steps))


Epoch 1 / 4
  Batch   100  of  4,800  Loss: 1.7694305181503296
  Batch   200  of  4,800  Loss: 1.7775938510894775
  Batch   300  of  4,800  Loss: 0.9623749852180481
  Batch   400  of  4,800  Loss: 0.8060488700866699
  Batch   500  of  4,800  Loss: 2.200165271759033
  Batch   600  of  4,800  Loss: 0.6863747835159302
  Batch   700  of  4,800  Loss: 0.694191575050354
  Batch   800  of  4,800  Loss: 0.7887923717498779
  Batch   900  of  4,800  Loss: 0.3921346068382263
  Batch 1,000  of  4,800  Loss: 1.0035321712493896
  Batch 1,100  of  4,800  Loss: 0.6936561465263367
  Batch 1,200  of  4,800  Loss: 0.33959072828292847
  Batch 1,300  of  4,800  Loss: 0.8235193490982056
  Batch 1,400  of  4,800  Loss: 0.22305309772491455
  Batch 1,500  of  4,800  Loss: 0.13993532955646515
  Batch 1,600  of  4,800  Loss: 0.2875683903694153
  Batch 1,700  of  4,800  Loss: 0.3189179003238678
  Batch 1,800  of  4,800  Loss: 0.13405248522758484
  Batch 1,900  of  4,800  Loss: 0.45669588446617126
  Batch 2,000  o

## Saving/Loading the fine-tuned model

In [None]:
## Create the output directory 
output_dir = './model_save_Bahareh/'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

model_to_save = model.module if hasattr(model, 'module') else model  # Takes care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

## Save the fine-tunned model on google drive
!cp -r ./model_save_Bahareh/ "/content/drive/MyDrive/"


In [None]:
## Load the fine-tunned model if needed
model = RobertaForSequenceClassification.from_pretrained(output_dir)
tokenizer = RobertaTokenizer.from_pretrained(output_dir)
model.to(device)
output.clear()

## Prediction

In [None]:
## TestData preparation

texts2 = df2.text.values

input_ids2 = []
for text in texts2:
    encoded_text2 = tokenizer.encode(text, add_special_tokens = True,)
    input_ids2.append(encoded_text2)
input_ids2 = pad_sequences(input_ids2, maxlen=Max_Len, 
                          dtype="long", truncating="post", padding="post")
attention_masks2 = []
for seq in input_ids2:
  seq_mask2 = [float(i>0) for i in seq]
  attention_masks2.append(seq_mask2) 

prediction_inputs = torch.tensor(input_ids2)
prediction_masks = torch.tensor(attention_masks2)

batch_size = batch_size  

prediction_data = TensorDataset(prediction_inputs, prediction_masks)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

In [None]:
## Prediction on TestData

model.eval()

predictions = []

for batch in prediction_dataloader:
  batch = tuple(t.to(device) for t in batch)
  
  b_input_ids, b_input_mask = batch

  with torch.no_grad():
    outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
  logits = outputs[0]

  logits = logits.detach().cpu().numpy()
  
  predictions.append(logits)

In [None]:
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_predictions

array([16,  4, 10, ..., 10, 16,  1])

In [None]:
flat_predictions = pd.DataFrame(flat_predictions, columns=['label_predicted'])
flat_predictions.to_csv('predictions_Bahareh.csv')

from google.colab import files
files.download('predictions_Bahareh.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>