In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer
from torch.utils.data import TensorDataset

In [None]:
df=pd.read_csv('/content/drive/My Drive/PubMed Multi Label Text Classification Dataset Processed.csv')

In [None]:
df=df.drop_duplicates()

Dropping duplicate values as they would be noise during training.

In [None]:
rowsums=df.iloc[:,6:].sum(axis=1)
no_label_count = 0
for sum in rowsums.values:
    if sum==0:
        no_label_count +=1
print("Total number of articles without label:", no_label_count)

Total number of articles without label: 361


Articles without any label (all label values of a row are 0) would be noise during training. To find the number of rows with no labels, the row sum is calculated for all rows, across the columns starting from 7th column (position 6). A loop iterates through rowsums, where if the value is 0, the count increases by 1.

In [None]:
check_row=[]
for sum in rowsums.values:
    check_row.append(sum)
df['check']=check_row
df=df.drop(df[df['check']==0].index)
df=df.drop(['check'],axis=1)
print("Removed articles without label")

Removed articles without label


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['check']=check_row


To remove the articles with no label, the row sums are added as a column to the dataframe so that a condition (to check if rowsums is 0) can be defined using this column to drop the rows where row sum is 0.

In [None]:
cols=list(df.columns)
mesh_Heading_categories=cols[6:]
num_labels=len(mesh_Heading_categories)
print('Root Labels:',mesh_Heading_categories)
print('Number of Labels:' ,num_labels)

Root Labels: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'Z']
Number of Labels: 14


To get the label names and number of labels, column names are stored as a list, then the list is sliced from index 6 (to the last index) and stored in another list, the length of which is calculated to get the number of labels.

In [None]:
df_train, df_test = train_test_split(df, random_state=32, test_size=0.20, shuffle=True)

print(df_train.shape)
print(df_test.shape)

(39650, 20)
(9913, 20)


The data is split into train and test data in an 80:20 split. The data is shuffled so that train and test data properly represent the whole data (no underlying patterns in the order of data). Random state ensures that the data is split in the exact same way whenever the split is done.

In [None]:
df_train['one_hot_labels'] = list(df_train[mesh_Heading_categories].values)

labels = list(df_train.one_hot_labels.values)
Article_train = list(df_train.abstractText.values)

To create a new column of one hot encoded values of labels in the training dataframe, the values of the labels are extracted and converted to a list. The elements of the list are assigned as values to the column.

The articles and the one hot encoded values of training data are stored as lists as they will be the independent and target variables.

In [None]:
max_length = 128
tokenizer=AutoTokenizer.from_pretrained('thomas-sounack/BioClinical-ModernBERT-base', do_lower_case=True)

encodings=tokenizer.batch_encode_plus(Article_train,max_length=max_length,padding=True,truncation=True)
print('tokenizer outputs: ', encodings.keys())

input_ids=encodings['input_ids']
attention_masks=encodings['attention_mask']

/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94:
UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.

To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.

You will be able to reuse this secret in all of your notebooks.

Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(

tokenizer_config.json:
 20.8k/? [00:00<00:00, 872kB/s]

tokenizer.json:
 2.13M/? [00:00<00:00, 28.3MB/s]

special_tokens_map.json: 100%
 693/693 [00:00<00:00, 51.2kB/s]

tokenizer outputs:  KeysView({'input_ids': [[50281, 1231, 6266, 11249, 253, 2219, 273, 767, 10652, 1363, 1119, 281, ....

Max length is a hyperparameter which defines the max token sequence length. I tried some combinations of max length and batch size (hyperparameter during loading the data in batches) and got OOM error for (256,64), (256,32), so (128,64) seemed appropriate.

A pretrained tokenizer is loaded for the tokenization of articles so that it is in the correct format of input for the transformer. The articles are converted to lower case before tokenization.

Tokenization is done on a batch of data, and not sequentially. If a token sequence is shorter than max length, padding is added and if a token sequence is longer than max length, then it is truncated.

Input IDs are the numerical representation of token sequences and attention masks define what tokens to consider and what tokens to ignore (padding tokens).

In [None]:
train_inputs, validation_inputs, train_labels, validation_labels, train_masks, validation_masks = train_test_split(input_ids, labels, attention_masks, random_state=2020, test_size=0.20)
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)

validation_inputs = torch.tensor(validation_inputs)
validation_labels = torch.tensor(validation_labels)
validation_masks = torch.tensor(validation_masks)

  train_labels = torch.tensor(train_labels)


Input IDs, labels (one hot encoded values) and attention masks are split into training and validation data in an 80:20 split. All training and validation data is converted into torch tensors, which is the standard data structure for deep learning.

In [None]:
train_data=TensorDataset(train_inputs, train_masks, train_labels,)

validation_data=TensorDataset(validation_inputs, validation_masks, validation_labels,)

torch.save(validation_data,'/content/drive/My Drive/tensor data/validation_data.pt')
torch.save(train_data,'/content/drive/My Drive/tensor data/train_data.pt')

Training and validation tensors are stored in tensor datasets as tuples. Tensor datasets make it easier to manage data to be loaded by the dataloaders.

Tensor datasets are saved in Google Drive, to be used in Training notebook.

In [None]:
df_test['one_hot_labels'] = list(df_test[mesh_Heading_categories].values)

test_labels = list(df_test.one_hot_labels.values)
Articles_test = list(df_test.abstractText.values)

To create a new column of one hot encoded values of labels in the tes dataframe, the values of the labels are extracted and converted to a list. The elements of the list are assigned as values to the column.

The articles and the one hot encoded values of test data are stored as lists as they will be the independent and target variables.

In [None]:
test_encodings = tokenizer.batch_encode_plus(Articles_test,max_length=max_length,padding=True,truncation=True)
test_input_ids = test_encodings['input_ids']
test_attention_masks = test_encodings['attention_mask']

Tokenization of the test data, input IDs and attention masks are extracted.

In [None]:
test_inputs = torch.tensor(test_input_ids)
test_labels = torch.tensor(test_labels)
test_masks = torch.tensor(test_attention_masks)

test_data = TensorDataset(test_inputs, test_masks, test_labels,)

torch.save(test_data,'/content/drive/My Drive/tensor data/test_data.pt')

Test tensors are stored in a tensor dataset as tuples. Tensor dataset is stored in Google Drive, to be used in Evaluation Notebook.