<a href="https://colab.research.google.com/github/Brotherswords/MeSH-Deep-Learning-Multi-Label-Classification-Model/blob/main/CSCI_4931_Quiz_5_Vivekanandasarma.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Quiz 5 (Take Home) Assignment

# Overview

This Python Notebook illustrates my efforts in creating and evaluating models for the task of assigning MeSH terms for medical articles from PubMed given an abstract. I explore two approaches.

1. Using an Embedding Matrix for an Embedding Layer, a unique/specialized word2vec aimed at medical information and LSTMs.

2. Using BERT, not as specialized for medical tasks but trained on much more data giving it the ability to find deeper nuances than my simple network.

For both, I was severely hindered by my ability to run models with enough compute. Something that (as I note later in my report) likely was a significant factor in the results. Efforts were made to run for longer epochs and more complex models - however computational limits were unforgiving.

## Installing Neccesary Libraries


In [None]:
!pip install tensorflow
!pip install transformers
!pip install keras
!pip install nltk



In [19]:
from google.colab import drive
from transformers import TFBertModel, BertTokenizer
import os
import zipfile
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, Input
from tensorflow.keras.models import Model
from sklearn.preprocessing import MultiLabelBinarizer
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.models import Sequential
from tensorflow.keras.models import load_model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np
import pickle

In [None]:
#Check for the file I need
drive.mount('/content/drive')
directory_path = '/content/drive/My Drive/Projects Things Useful/University/CU Denver 2023-2024 Sem 1/Deep Learning/Quiz_5_Data'
pickle_path = directory_path + "/Pickle_Files"
# Check if the directory exists
if os.path.exists(directory_path):
    # List all files and directories in the specified path
    files = os.listdir(directory_path)
    print("Files and directories in '", directory_path, "' :")
    for i in files:
      print(i)
else:
    print("The directory does not exist")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Files and directories in ' /content/drive/My Drive/Projects Things Useful/University/CU Denver 2023-2024 Sem 1/Deep Learning/Quiz_5_Data ' :
training-set-100000.json
test-set-20000-rev2.json
judge-set-10000-unannotated.json
Pickle_Files
BioWordVec_PubMed_MIMICIII_d200.vec.bin
predicted_labels.json
bert-models


In [None]:
import json
file_name= "/training-set-100000.json"
training_path = f'{directory_path}/{file_name}'
data = []
with open(training_path, 'r') as file:
    data = json.load(file)
    if data:
      print("Training Data Loaded")

Training Data Loaded


## Step 1. Data Pre-Processing

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download stopwords and wordnet data
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Load English stopwords
stop_words = set(stopwords.words('english'))

def clean_text(text):
    # Remove HTML tags using regex/Basic cleaning steps just to make sure that the data is clean
    text = re.sub(r'<.*?>', '', text)

    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Lowercase all texts
    text = text.lower()

    # Tokenize text
    words = text.split()

    # Remove stopwords and apply stemming or lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    # words = [stemmer.stem(word) for word in words if word not in stop_words]

    # Join the words back into a single string
    text = ' '.join(words)

    return text

# cleaned_texts = [clean_text(text) for text in texts]

# # Now `cleaned_texts` contains the preprocessed article texts


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
# Assuming `data` is already loaded with the articles data in the above format
# Combine title and abstract for each article and gather labels
texts = [clean_text(d['title'] + ' ' + d['abstractText']) for d in data['articles']]
labels = [d['meshMajor'] for d in data['articles']]

# Initialize the tokenizer with a specific vocabulary size, and a filter for punctuation
# Highest Value of Vocabulary Size I could do without running out of RAM
vocabulary_size = 15000  # for example, you might choose the top 20,000 words
tokenizer = Tokenizer(num_words=vocabulary_size, lower=True, oov_token='UNK', filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n')

tokenizer.fit_on_texts(texts)

# Convert texts to sequences of integer indices
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index

# Determine a suitable maximum sequence length based on the data distribution
# Highest Value I could do without running out of RAM
max_seq_length = int(np.percentile([len(seq) for seq in sequences], 75))

# Pad the sequences so that they all have the same length
data_padded = pad_sequences(sequences, maxlen=max_seq_length, padding='post')

# Initialize MultiLabelBinarizer to convert the labels to a binary matrix
mlb = MultiLabelBinarizer()
label_data = mlb.fit_transform(labels)

# This absolutely devours RAM, use sparingly.
# # Save the tokenizer, MultiLabelBinarizer, max_seq_length for later use
# with open(pickle_path + '/preprocessing_objects.pickle', 'wb') as handle:
#     pickle.dump({'tokenizer': tokenizer, 'mlb': mlb, 'max_seq_length': max_seq_length,
#                  'data_padded': data_padded, 'label_data': label_data}, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Now `data_padded` contains the padded sequences and `label_data` contains the multi-hot encoded labels
# `word_index` contains the word index, and `max_seq_length` is the length up to which sequences will be padded
print("Preprocessing and saving completed.")


Preprocessing and saving completed.


## Prepping Data for ~~Roberta~~ BERT (Google Colab keeps crashing if I make any of my models too complex 😢) Model

In [None]:
from transformers import BertTokenizer
import numpy as np

tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')
max_seq_length = int(np.percentile([len(seq) for seq in sequences], 75))


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



In [None]:
# Tokenize the texts and create attention masks
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-uncased')

input_ids = []
attention_masks = []

max_seq_length = int(np.percentile([len(seq) for seq in sequences], 70))

for text in texts:
    encoded_dict = tokenizer_bert.encode_plus(
                        text,
                        add_special_tokens = True,  # Add '[CLS]' and '[SEP]'
                        max_length = max_seq_length,
                        pad_to_max_length = True,
                        return_attention_mask = True,
                        return_tensors = 'np',  # Return numpy arrays
                        truncation=True
                   )

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

input_ids = np.concatenate(input_ids, axis=0)
attention_masks = np.concatenate(attention_masks, axis=0)

# Data for the BERT model is different that it would be for a standard LSTM based approach
X_Train_BERT = [input_ids, attention_masks]
Y_Train_BERT = label_data



In [None]:
# # This proved to be too memory intensive for Google Colab
# # Load all preprocessing objects at once
# with open(pickle_path + '/preprocessing_objects.pickle', 'rb') as handle:
#     preprocessing_objects = pickle.load(handle)

# tokenizer = preprocessing_objects['tokenizer']
# mlb = preprocessing_objects['mlb']
# max_seq_length = preprocessing_objects['max_seq_length']
# # data_padded = preprocessing_objects['data_padded']
# label_data = preprocessing_objects['label_data']
# print("Loaded")

Loaded


In [None]:
n = 10
for i, (word, index) in enumerate(tokenizer.word_index.items()):
    print(f"{i + 1}. {word}: {index}")
    if i >= n - 1:  # since index starts at 0, we use n - 1
        break

1. UNK: 1
2. patient: 2
3. cell: 3
4. study: 4
5. result: 5
6. protein: 6
7. p: 7
8. group: 8
9. effect: 9
10. level: 10


## Step 2. Creating the Embedded Layer 🐪

### RUN THIS ONCE, RELOAD AND THEN AFTERWARDS JUST USE PICKLE FILE VERSION
(IT TAKES FOREVER TO LOAD)

In [None]:
from gensim.models import KeyedVectors

embedding_path = directory_path + "/BioWordVec_PubMed_MIMICIII_d200.vec.bin"
# Here we use BioWord2Vec rather than standard Word2Vec since its meant for more medical-centric text
embeddings = KeyedVectors.load_word2vec_format(embedding_path, binary=True)



# Get the word_index from the tokenizer
word_index = tokenizer.word_index

# Initialize the embedding matrix
embedding_matrix = np.zeros((len(word_index) + 1, 200))  # Add 1 for padding token

# Populate the embedding matrix
for word, i in word_index.items():
    # Check if the word is in the model
    if word in embeddings.key_to_index:
        # Get the embedding vector for the word
        embedding_vector = embeddings[word]
        # If an embedding was found, add it to the matrix
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

with open(pickle_path + "/embedding_matrix_15000", 'wb') as handle:
    pickle.dump(embedding_matrix, handle, protocol=pickle.HIGHEST_PROTOCOL)

print("Embedding matrix saved.")


Embedding matrix saved.


In [None]:
# Create the embedding layer with the embedding matrix

loaded_embedding_matrix = []
with open(pickle_path + "/embedding_matrix_15000", 'rb') as handle:
    loaded_embedding_matrix = pickle.load(handle)

embedding_layer = Embedding(input_dim=len(word_index) + 1,
                            output_dim=200,
                            weights=[loaded_embedding_matrix],
                            input_length=max_seq_length,  # As determined earlier
                            trainable=False)

## Step 3. Create LSTM layers

In [None]:
# Number of unique labels in MeSH classification
num_labels = len(mlb.classes_)

model = Sequential()
# Add the pre-loaded embedding layer
model.add(embedding_layer)
# Add an LSTM layer
model.add(Bidirectional(LSTM(units=128, return_sequences=True)))
model.add(Dropout(0.1))
model.add(Bidirectional(LSTM(units=64)))
# model.add(Dropout(0.2))
# Add a dense layer
model.add(Dense(units=128, activation='relu'))
# model.add(Dropout(0.2))
# Output layer with a sigmoid activation for multi-label classification
# Final layer for each label
model.add(Dense(units=num_labels, activation='sigmoid'))
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 155, 200)          54306400  
                                                                 
 bidirectional (Bidirectiona  (None, 155, 256)         336896    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 155, 256)          0         
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              164352    
 nal)                                                            
                                                                 
 dense (Dense)               (None, 128)               16512     
                                                                 
 dense_1 (Dense)             (None, 22373)             2

Lets try building a second model thats different because not gonna lie the other one was pretty bad lmfao at least with Recall() Kind of a bit of a skill issue tbh

In [None]:
from transformers import TFBertModel, BertTokenizer
from tensorflow.keras.layers import Input, Dense, Dropout, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.metrics import Precision, Recall
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l1_l2

num_labels = len(mlb.classes_)

# Load BERT model and tokenizer
bert_model = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(bert_model)
bert = TFBertModel.from_pretrained(bert_model)

# Freeze BERT layers so Colab can actually run this, I imagine this would be alot better if I could mark this as trainable
for layer in bert.layers:
    layer.trainable = False

# Model inputs
input_ids = Input(shape=(max_seq_length,), dtype='int32', name='input_ids')
attention_mask = Input(shape=(max_seq_length,), dtype='int32', name='attention_mask')

# BERT embeddings
embeddings = bert(input_ids, attention_mask=attention_mask)[1]

# Additional layers
x = Dense(128, activation='relu', kernel_regularizer=l1_l2(l1=1e-5, l2=1e-4))(embeddings)
x = BatchNormalization()(x)
x = Dropout(0.1)(x)
output = Dense(num_labels, activation='sigmoid')(x)

# Define the model
bert_model = Model(inputs=[input_ids, attention_mask], outputs=output)

# Compile the model with a learning rate scheduler
optimizer = Adam(learning_rate=0.001)
bert_model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=[Precision(), Recall()])

bert_model.summary()


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 148)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 148)]        0           []                               
                                                                                                  
 tf_bert_model (TFBertModel)    TFBaseModelOutputWi  109482240   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 148,                                           

## Step 4. Train the model and pray that Google Colab Pro is Enough!

In [None]:
# Load the training data
X_train = data_padded  # padded sequence data
Y_train = label_data   # multi-hot encoded MeSH terms

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.metrics import Recall

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[Recall()])


# Set training parameters
batch_size = 32
epochs = 25
validation_split = 0.1  # Percentage of data to use as validation

# Define an EarlyStopping callback to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1)

# Train the model
history = model.fit(X_train, Y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_split=validation_split,
                    callbacks=[early_stopping],
                    verbose=1)

print("Training completed.")

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 23: early stopping
Training completed.


## Time to train a second model to see which one is better, this one is transformer based (😞 running out of System RAM)

In [8]:
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

# Compile the second model
bert_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=[Precision(), Recall()])

# Set training parameters
batch_size = 32
epochs = 10 # Averages about 1.5 hours per epoch any more than this will cause Google Colab to timeout my session ):
validation_split = 0.1  # Use 10% of the training data for validation


# Define the checkpoint path and filenames
checkpoint_path = pickle_path + "/bert-models/model-{epoch:04d}.ckpt"  # Replace with your path and file naming scheme

# Create a callback that saves the model's weights every epoch
checkpoint_callback = ModelCheckpoint(
    filepath=checkpoint_path,
    verbose=1,
    save_weights_only=True,
    save_freq='epoch'
)


# Define an EarlyStopping callback to prevent overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=3, verbose=1)

# Train the model
history = bert_model.fit(X_Train_BERT, Y_Train_BERT,
                           batch_size=batch_size,
                           epochs=epochs,
                           validation_split=validation_split,  # Use validation split
                           callbacks=[early_stopping],
                           verbose=1)

print("Training completed.")


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 8: early stopping
Training completed.


Oh the Colab Environemnt Crashed because we ran out of memory when training the BERT based model... (biggest constraint for this project is that the compute/computing environments I have access to are not them)

In [9]:
model_save_path = pickle_path + '/saved_model_25_Epochs_Recall.h5'  # Replace with your desired path
second_model_save_path = pickle_path + '/saved_model_25_Epochs_BERT.h5'  # Replace with your desired path


In [12]:
# Save the trained model
bert_model.save(second_model_save_path)
print(f"Model saved at {second_model_save_path}")


Model saved at /content/drive/My Drive/Projects Things Useful/University/CU Denver 2023-2024 Sem 1/Deep Learning/Quiz_5_Data/Pickle_Files/saved_model_25_Epochs_BERT.h5


In [None]:
# Save the model
# normal version of model
# model_save_path = pickle_path + '/saved_model.h5'  # Replace with your desired path
model.save(model_save_path)
print(f"Model saved at {model_save_path}")


Model saved at /content/drive/My Drive/Projects Things Useful/University/CU Denver 2023-2024 Sem 1/Deep Learning/Quiz_5_Data/Pickle_Files/saved_model_25_Epochs_Recall.h5


## Step 5. Call the saved model and run it. Give it a test run to see we're getting real results.

In [None]:
from tensorflow.keras.models import load_model

# Load the model
loaded_model = load_model(model_save_path)
print("Model loaded successfully.")

Model loaded successfully.


In [14]:
from transformers import TFBertModel

# Specify the custom object (TFBertModel in this case)
custom_objects = {"TFBertModel": TFBertModel}

# Load the model
loaded_BERT_based_model = load_model(second_model_save_path, custom_objects=custom_objects)
print("Model loaded successfully.")




Model loaded successfully.


In [None]:
first_sample = X_train[1:2]  # Select the first sample
predicted_result = loaded_model.predict(first_sample)  # Predict using the loaded model

# Print the predicted result
print("Predicted result:", predicted_result)

# Print the actual result
actual_result = Y_train[1]
print("Actual result:", actual_result)



Predicted result: [[6.98327653e-07 5.76985667e-06 2.49171399e-06 ... 1.08561708e-05
  1.24572925e-05 1.07505741e-06]]
Actual result: [0 0 0 ... 0 0 0]


In [None]:
# Use inverse_transform to convert binary vectors back to labels
import numpy as np

# Convert predicted result to binary and then to labels
predicted_labels_conformed = mlb.inverse_transform(np.round(predicted_result))

# Ensure actual_result is a 2D array for inverse_transform
actual_labels_conformed = mlb.inverse_transform(np.array([actual_result]))

print("Predicted MeSH terms:", predicted_labels_conformed[0])
print("Actual MeSH terms:", actual_labels_conformed[0])



Predicted MeSH terms: ('Adult', 'Aged', 'Carcinoma, Squamous Cell', 'Case-Control Studies', 'Female', 'Genetic Predisposition to Disease', 'Genotype', 'Humans', 'Male', 'Middle Aged', 'Polymorphism, Genetic', 'Uterine Cervical Neoplasms')
Actual MeSH terms: ('Adult', 'Aged', 'Aged, 80 and over', 'Carcinoma, Squamous Cell', 'Case-Control Studies', 'Female', 'Gene Frequency', 'Genetic Predisposition to Disease', 'Genotype', 'Germany', 'Greece', 'Humans', 'Interleukin-8', 'Male', 'Middle Aged', 'Mouth Neoplasms', 'Polymorphism, Genetic', 'Reverse Transcriptase Polymerase Chain Reaction', 'Risk')


## Step 4. Seeing how well we did on the test set 🙏

In [None]:
# Load the test set
import json
file_name= "test-set-20000-rev2.json"
training_path = f'{directory_path}/{file_name}'
data = []
with open(training_path, 'r') as file:
    data = json.load(file)
    if data:
      print("Test Data Loaded")

texts = [clean_text(d['title'] + ' ' + d['abstractText']) for d in data['documents']]
labels = [d['meshMajor'] for d in data['documents']]
sequences = tokenizer.texts_to_sequences(texts)

# Pad the sequences so that they all have the same length
X_Test = data_padded_test = pad_sequences(sequences, maxlen=max_seq_length, padding='post')
Y_Test = label_data_test = mlb.transform(labels)

Test Data Loaded


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

# Make predictions on the test set
test_predictions = loaded_model.predict(X_Test)
binary_predictions = np.round(test_predictions)



In [None]:
print(len(Y_Test[0]))
print(len(X_Test[0]))

print(Y_Test[0])
print(X_Test[0])

22373
155
[0 0 0 ... 0 0 0]
[2058 3561 3506 1692 3315   16    1 1301   11 1271  111 2058 3561 3506
 1692 3315  155 1188  416  983   16    1 1301  103 2863  607   42 1301
   44   62   11 3506 1247  121 2263   61    1  212 2058 3561 1692 3315
  416   81 1301   33  761  416 2263  212 1353  643 2263  106   86  985
  212   62 1188   81 1301  510 1365  111   20    5 1423 1379   10 1301
   20  766 1081  943 1301    6   62  416 2263    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0]


## Metric Evaluations: LSTM Model with Embedding Layer

In [None]:
# Calculate the metrics
accuracy = accuracy_score(Y_Test, binary_predictions)
micro_precision = precision_score(Y_Test, binary_predictions, average='micro')
micro_recall = recall_score(Y_Test, binary_predictions, average='micro')
micro_f1 = f1_score(Y_Test, binary_predictions, average='micro')

In [None]:
print("Accuracy on Test Data:", accuracy)
print("Micro Precision on Test Data:", micro_precision)
print("Micro Recall on Test Data:", micro_recall)
print("Micro F1 Score on Test Data:", micro_f1)

Accuracy on Test Data: 0.0
Micro Precision on Test Data: 0.6947941675938397
Micro Recall on Test Data: 0.2274112492381292
Micro F1 Score on Test Data: 0.3426655422577515


## Metric Evaluations: BERT-Based Transformer Model

In [16]:
# BERT Transformer Based Mode:
import json
file_name= "test-set-20000-rev2.json"
training_path = f'{directory_path}/{file_name}'
data = []
with open(training_path, 'r') as file:
    data = json.load(file)
    if data:
      print("Test Data Loaded")

texts = [clean_text(d['title'] + ' ' + d['abstractText']) for d in data['documents']]
labels = [d['meshMajor'] for d in data['documents']]

input_ids = []
attention_masks = []

for text in texts:
    encoded_dict = tokenizer_bert.encode_plus(
                        text,
                        add_special_tokens = True,  # Add '[CLS]' and '[SEP]'
                        max_length = max_seq_length,
                        pad_to_max_length = True,
                        return_attention_mask = True,
                        return_tensors = 'np',  # Return numpy arrays
                        truncation=True
                   )

    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

input_ids = np.concatenate(input_ids, axis=0)
attention_masks = np.concatenate(attention_masks, axis=0)

# Data for the BERT model is different that it would be for a standard LSTM based approach
X_Test_BERT = [input_ids, attention_masks]
Y_Test_BERT = mlb.transform(labels)

Test Data Loaded




In [17]:
test_predictions_BERT = loaded_BERT_based_model.predict(X_Test_BERT)
binary_predictions_BERT = np.round(test_predictions_BERT)



In [20]:
accuracy = accuracy_score(Y_Test_BERT, binary_predictions_BERT)
micro_precision = precision_score(Y_Test_BERT, binary_predictions_BERT, average='micro')
micro_recall = recall_score(Y_Test_BERT, binary_predictions_BERT, average='micro')
micro_f1 = f1_score(Y_Test_BERT, binary_predictions_BERT, average='micro')

In [22]:
print("Transformer Model Performance")
print("Accuracy on Test Data:", accuracy)
print("Micro Precision on Test Data:", micro_precision)
print("Micro Recall on Test Data:", micro_recall)
print("Micro F1 Score on Test Data:", micro_f1)

Transformer Model Performance
Accuracy on Test Data: 0.0
Micro Precision on Test Data: 0.39523151277885965
Micro Recall on Test Data: 0.15015467414929218
Micro F1 Score on Test Data: 0.2176287571531752


# Model Report

After evaluating my model on the test data for the multi-label classification task, the results suggest that it's moderately effective, but there's significant room for improvement. Notably, the model's accuracy is surprisingly low at 0.0, indicating it's not predicting all the labels for any single sample correctly. This is a known challenge in multi-label classification, where accuracy demands a perfect match of all labels.

The micro precision, at 69.48%, shows that a majority of the predicted labels are correct. However, the micro recall is only 22.74%, implying that the model is missing many relevant labels. This is further reflected in the moderate micro F1 score of 34.27%, which balances precision and recall.

The discrepancy between precision and recall suggests a conservative model that makes fewer errors in its predictions but overlooks a significant number of correct labels. This could stem from various factors, such as the model's complexity, potential class imbalance in the dataset, or the need for more extensive training. To enhance the model's effectiveness, fine-tuning, and a more in-depth analysis are crucial next steps.

Furthermore, the transformer based model perhaps using the BERT model for may be worth it. However due to computational limitations (I had to freeze BERT's weights to even get the model to train on Google Colab) I was unable to dedicate any additional resources to further explore this beyond just 8 epochs of training. As it is now, its precision and recall was far inferior ot that of the LSTM model with an embedding layer.  



# Step 6. Making Predictions for Judge and Writing to JSON

In [None]:
# Load the test set
import json
file_name= "judge-set-10000-unannotated.json"
training_path = f'{directory_path}/{file_name}'
data = []
with open(training_path, 'r') as file:
    data = json.load(file)
    if data:
      print("Judge Data Loaded")

texts = [clean_text(d['title'] + ' ' + d['abstractText']) for d in data['documents']]
labels = [d['pmid'] for d in data['documents']]
sequences = tokenizer.texts_to_sequences(texts)

# Pad the sequences so that they all have the same length
X_Judge = data_padded_test = pad_sequences(sequences, maxlen=max_seq_length, padding='post')

Judge Data Loaded


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import numpy as np

# Make predictions on the judge set
judge_predictions = loaded_model.predict(X_Judge)
binary_predictions = np.round(judge_predictions)
predicted_labels_conformed = mlb.inverse_transform((binary_predictions))



In [None]:
import json

# Prepare data for JSON export
json_output = {"documents": []}
for pmid, labels in zip(labels, predicted_labels_conformed):
    json_output["documents"].append({"pmid": pmid, "labels": list(labels)})

# Write the data to a JSON file
output_path = f'{directory_path}/predicted_labels.json'
with open(output_path, 'w') as file:
    json.dump(json_output, file, indent=4)

print(f"Predictions saved to {output_path}")


Predictions saved to /content/drive/My Drive/Projects Things Useful/University/CU Denver 2023-2024 Sem 1/Deep Learning/Quiz_5_Data/predicted_labels.json
