#Milestone Project 2: SkimLit 📄🔥

The purpose of this notebook is to build an NLP model to make reading medical abstracts easier

The paper we're replicating(the source of the dataset that we'll be using) is available here https://arxiv.org/abs/1710.06071

And reading through the paper above we see that model architecture that they use to achieve their best result is available here  https://arxiv.org/abs/1612.05251



# Confirm access to a GPU

In [1]:
!nvidia-smi

Tue Jan 14 19:51:02 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   30C    P0              42W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

# Get data

Since we'll be replicating the paper above (PubMed 200k RCT), let's download the dataset they used.

We can do so from the authors GitHub: https://github.com/Franck-Dernoncourt/pubmed-rct

In [2]:
!git clone https://github.com/Franck-Dernoncourt/pubmed-rct
!ls pubmed-rct

Cloning into 'pubmed-rct'...
remote: Enumerating objects: 39, done.[K
remote: Counting objects: 100% (14/14), done.[K
remote: Compressing objects: 100% (9/9), done.[K
remote: Total 39 (delta 8), reused 5 (delta 5), pack-reused 25 (from 1)[K
Receiving objects: 100% (39/39), 177.08 MiB | 14.86 MiB/s, done.
Resolving deltas: 100% (15/15), done.
PubMed_200k_RCT				       PubMed_20k_RCT_numbers_replaced_with_at_sign
PubMed_200k_RCT_numbers_replaced_with_at_sign  README.md
PubMed_20k_RCT


In [3]:
# Check what files are in the PubMed_20K dataset
!ls pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/

dev.txt  test.txt  train.txt


In [4]:
# Start our experiments using the 20k dataset with numbers replaced by "@" sign
data_dir = "/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/"

In [5]:
# Check all of the filenames in the target directory
import os
filename = [data_dir + filename for filename in os.listdir(data_dir)]
filename

['/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/test.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/train.txt',
 '/content/pubmed-rct/PubMed_20k_RCT_numbers_replaced_with_at_sign/dev.txt']

# Preprocess data
Now we've got some text data, it's time to become one with it.

And one of the best ways to become one with the data is to...

      Visualize, visualize, visualize

So with that in mind, let's write a function to read in all of the lines of a target text file.

In [6]:
# Create function to read the lines of a document
def get_line(filename):
   """
  Reads filename (a text filename) and returns the lines of text as a list.

  Args:
    filename: a string containing the target filepath.

  Returns:
    A list of strings with one string per line from the target filename.
  """
   with open(filename,"r") as f:
     return f.readlines()

In [7]:
# Let's read in the training lines
train_lines = get_line(data_dir + "train.txt")
train_lines[:27]

['###24293578\n',
 'OBJECTIVE\tTo investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( OA ) .\n',
 'METHODS\tA total of @ patients with primary knee OA were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .\n',
 'METHODS\tOutcome measures included pain reduction and improvement in function scores and systemic inflammation markers .\n',
 'METHODS\tPain was assessed using the visual analog pain scale ( @-@ mm ) .\n',
 'METHODS\tSecondary outcome measures included the Western Ontario and McMaster Universities Osteoarthritis Index scores , patient global assessment ( PGA ) of the severity of knee OA , and @-min walk distance ( @MWD ) .\n',
 'METHODS\tSerum levels of interleukin @ ( IL-@ ) , IL-@ , tumor necrosis factor ( TNF ) - , and 

In [8]:
len(train_lines)

210040

In [9]:
def preprocess_text_with_line_numbers(filename):
  """
  Returns a list of dictionaries of abstract line data.

  Takes in filename, reads it contents and sorts through each line,
  extracting things like the target label, the text of the sentnece,
  how many sentences are in the current abstract and what sentence
  number the target line is.
  """
  input_lines = get_line(filename) # get all lines from filename
  abstract_line = ""  # create an empty abstract
  abstract_samples = [] # create an empty list of abstracts

  # Loop through each line in the target file
  for line in input_lines:
    if line.startswith("###"):# check to see if the is an ID line
      abstract_id = line
      abstract_line = ""# reset the abstract string if the line is an ID line

    elif line.isspace(): # check to see if line is a new line
      abstract_line_split = abstract_line.splitlines()# split abstract into separate lines

      # Iterate through each line in a single abstract and count them at the same time
      for abstract_line_number,abstract_line in enumerate(abstract_line_split):
        line_data = {} # create an empty dictionary for each line
        target_text_split = abstract_line.split("\t") # split target label from text
        line_data["target"] = target_text_split[0] # get target label
        line_data["text"] = target_text_split[1].lower()# get target text and lower it
        line_data["line_number"] = abstract_line_number  # what number line does the line appear in the abstract?
        line_data["total_lines"] = len(abstract_line_split) - 1# how many total lines are there in the target abstract? (start from 0)
        abstract_samples.append(line_data) # add line data to abstract samples list
    else: # if the above conditions aren't fulfilled, the line contains a labelled sentence
       abstract_line += line
  return abstract_samples


In [10]:
# Get data from file and preprocess it
%%time
train_samples = preprocess_text_with_line_numbers(data_dir + "train.txt")
val_samples = preprocess_text_with_line_numbers(data_dir + "dev.txt")
test_samples = preprocess_text_with_line_numbers(data_dir + "test.txt")
print(len(train_samples), len(val_samples), len(test_samples))

180040 30212 30135
CPU times: user 445 ms, sys: 105 ms, total: 549 ms
Wall time: 548 ms


In [11]:
# Check the first abstract of our training data
train_samples[:14]

[{'target': 'OBJECTIVE',
  'text': 'to investigate the efficacy of @ weeks of daily low-dose oral prednisolone in improving pain , mobility , and systemic low-grade inflammation in the short term and whether the effect would be sustained at @ weeks in older adults with moderate to severe knee osteoarthritis ( oa ) .',
  'line_number': 0,
  'total_lines': 11},
 {'target': 'METHODS',
  'text': 'a total of @ patients with primary knee oa were randomized @:@ ; @ received @ mg/day of prednisolone and @ received placebo for @ weeks .',
  'line_number': 1,
  'total_lines': 11},
 {'target': 'METHODS',
  'text': 'outcome measures included pain reduction and improvement in function scores and systemic inflammation markers .',
  'line_number': 2,
  'total_lines': 11},
 {'target': 'METHODS',
  'text': 'pain was assessed using the visual analog pain scale ( @-@ mm ) .',
  'line_number': 3,
  'total_lines': 11},
 {'target': 'METHODS',
  'text': 'secondary outcome measures included the western ontari

In [12]:
import pandas as pd
train_df = pd.DataFrame(train_samples)
val_df = pd.DataFrame(val_samples)
test_df = pd.DataFrame(test_samples)
train_df.head(14)

Unnamed: 0,target,text,line_number,total_lines
0,OBJECTIVE,to investigate the efficacy of @ weeks of dail...,0,11
1,METHODS,a total of @ patients with primary knee oa wer...,1,11
2,METHODS,outcome measures included pain reduction and i...,2,11
3,METHODS,pain was assessed using the visual analog pain...,3,11
4,METHODS,secondary outcome measures included the wester...,4,11
5,METHODS,"serum levels of interleukin @ ( il-@ ) , il-@ ...",5,11
6,RESULTS,there was a clinically relevant reduction in t...,6,11
7,RESULTS,the mean difference between treatment arms ( @...,7,11
8,RESULTS,"further , there was a clinically relevant redu...",8,11
9,RESULTS,these differences remained significant at @ we...,9,11


In [13]:
# Distribution of labels in training data
train_df.target.value_counts()

Unnamed: 0_level_0,count
target,Unnamed: 1_level_1
METHODS,59353
RESULTS,57953
CONCLUSIONS,27168
BACKGROUND,21727
OBJECTIVE,13839


In [14]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Load your training data (replace train_samples with your actual data if not loaded)
train_df = pd.DataFrame(train_samples)

# Data Cleaning: Remove special characters (e.g., '@')
train_df["text"] = train_df["text"].str.replace(r"[@#%^&*()]", "", regex=True)

# Feature Engineering: Normalize line_number and total_lines
scaler = MinMaxScaler()
train_df["line_number_normalized"] = scaler.fit_transform(train_df[["line_number"]])
train_df["total_lines_normalized"] = scaler.fit_transform(train_df[["total_lines"]])

# Create relative position feature
train_df["relative_position"] = train_df["line_number"] / train_df["total_lines"]

# Check the updated dataframe
print(train_df.head())


      target                                               text  line_number  \
0  OBJECTIVE  to investigate the efficacy of  weeks of daily...            0   
1    METHODS  a total of  patients with primary knee oa were...            1   
2    METHODS  outcome measures included pain reduction and i...            2   
3    METHODS  pain was assessed using the visual analog pain...            3   
4    METHODS  secondary outcome measures included the wester...            4   

   total_lines  line_number_normalized  total_lines_normalized  \
0           11                0.000000                0.296296   
1           11                0.033333                0.296296   
2           11                0.066667                0.296296   
3           11                0.100000                0.296296   
4           11                0.133333                0.296296   

   relative_position  
0           0.000000  
1           0.090909  
2           0.181818  
3           0.272727  
4      

In [16]:
from sklearn.preprocessing import LabelEncoder

# Ensure label encoding
if "label_encoded" not in train_df.columns:
    label_encoder = LabelEncoder()
    train_df["label_encoded"] = label_encoder.fit_transform(train_df["target"])
    val_df["label_encoded"] = label_encoder.transform(val_df["target"])  # Apply the same encoder to validation data


In [17]:
# Verify columns
print(train_df.columns)
print(val_df.columns)


Index(['target', 'text', 'line_number', 'total_lines',
       'line_number_normalized', 'total_lines_normalized', 'relative_position',
       'label_encoded'],
      dtype='object')
Index(['target', 'text', 'line_number', 'total_lines', 'label_encoded'], dtype='object')


In [22]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the scaler
scaler = MinMaxScaler()

# Fit on training data (line_number and total_lines)
scaler.fit(train_df[["line_number", "total_lines"]])

# Normalize 'line_number' and 'total_lines' for train_df
train_df[["line_number_normalized", "total_lines_normalized"]] = scaler.transform(train_df[["line_number", "total_lines"]])
train_df["relative_position"] = train_df["line_number"] / train_df["total_lines"]

# Normalize 'line_number' and 'total_lines' for val_df (or test_df)
val_df[["line_number_normalized", "total_lines_normalized"]] = scaler.transform(val_df[["line_number", "total_lines"]])
val_df["relative_position"] = val_df["line_number"] / val_df["total_lines"]


In [24]:
import tensorflow as tf
from transformers import BertTokenizer

# Initialize the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Function to preprocess data
def preprocess_data(df, tokenizer):
    # Tokenize the text
    encoding = tokenizer(
        list(df["text"].values),
        max_length=128,  # Maximum sequence length for BERT
        padding="max_length",  # Pad sequences to the max length
        truncation=True,  # Truncate sequences longer than max length
        return_tensors="tf"
    )
    # Combine tokenized text with additional features
    features = tf.data.Dataset.from_tensor_slices((
        {
            "input_ids": encoding["input_ids"],
            "attention_mask": encoding["attention_mask"],
            "line_number_normalized": df["line_number_normalized"].values,
            "total_lines_normalized": df["total_lines_normalized"].values,
            "relative_position": df["relative_position"].values
        },
        df["label_encoded"].values  # Target labels
    ))
    return features

# Prepare datasets for training and testing
train_dataset = preprocess_data(train_df, tokenizer)
test_dataset = preprocess_data(val_df, tokenizer)  # If val_df is your test dataset

# Shuffle and batch the training dataset
train_dataset = train_dataset.shuffle(1000).batch(16).prefetch(tf.data.AUTOTUNE)

# Batch the test dataset
test_dataset = test_dataset.batch(16).prefetch(tf.data.AUTOTUNE)


In [25]:
# Print a sample from the train_dataset
for features, label in train_dataset.take(1):
    print("Input IDs:", features["input_ids"])
    print("Attention Mask:", features["attention_mask"])
    print("Line Number Normalized:", features["line_number_normalized"])
    print("Total Lines Normalized:", features["total_lines_normalized"])
    print("Relative Position:", features["relative_position"])
    print("Label:", label)



Input IDs: tf.Tensor(
[[  101  9560  5761 ...     0     0     0]
 [  101  2695 25918 ...     0     0     0]
 [  101 15488  2102 ...     0     0     0]
 ...
 [  101 10462  1010 ...     0     0     0]
 [  101  1996 20647 ...     0     0     0]
 [  101  1037  4997 ...     0     0     0]], shape=(16, 128), dtype=int32)
Attention Mask: tf.Tensor(
[[1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]], shape=(16, 128), dtype=int32)
Line Number Normalized: tf.Tensor(
[0.06666667 0.13333333 0.03333333 0.43333333 0.4        0.26666667
 0.1        0.33333333 0.16666667 0.46666667 0.16666667 0.33333333
 0.23333333 0.16666667 0.06666667 0.4       ], shape=(16,), dtype=float64)
Total Lines Normalized: tf.Tensor(
[0.2962963  0.33333333 0.59259259 0.59259259 0.33333333 0.2962963
 0.40740741 0.59259259 0.37037037 0.44444444 0.25925926 0.2962963
 0.48148148 0.48148148 0.18518519 0.33333333], shape=(16,), dtype=float64)
Relative Position: 

In [26]:
from transformers import TFBertModel
import tensorflow as tf

# Load pretrained BERT model
bert_model = TFBertModel.from_pretrained("bert-base-uncased")

# Define inputs
input_ids = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.layers.Input(shape=(128,), dtype=tf.int32, name="attention_mask")
line_number_normalized = tf.keras.layers.Input(shape=(1,), dtype=tf.float32, name="line_number_normalized")
total_lines_normalized = tf.keras.layers.Input(shape=(1,), dtype=tf.float32, name="total_lines_normalized")
relative_position = tf.keras.layers.Input(shape=(1,), dtype=tf.float32, name="relative_position")

# BERT outputs
bert_outputs = bert_model(input_ids, attention_mask=attention_mask)
bert_pooled_output = bert_outputs.pooler_output  # [CLS] token embedding

# Combine BERT output with positional features
combined_features = tf.keras.layers.Concatenate()(
    [bert_pooled_output, line_number_normalized, total_lines_normalized, relative_position]
)

# Fully connected layers for classification
dense = tf.keras.layers.Dense(128, activation="relu")(combined_features)
output = tf.keras.layers.Dense(len(train_df["label_encoded"].unique()), activation="softmax")(dense)

# Build the model
model = tf.keras.Model(
    inputs=[input_ids, attention_mask, line_number_normalized, total_lines_normalized, relative_position],
    outputs=output
)

# Compile the model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

model.summary()


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_ids (InputLayer)      [(None, 128)]                0         []                            
                                                                                                  
 attention_mask (InputLayer  [(None, 128)]                0         []                            
 )                                                                                                
                                                                                                  
 tf_bert_model (TFBertModel  TFBaseModelOutputWithPooli   1094822   ['input_ids[0][0]',           
 )                           ngAndCrossAttentions(last_   40         'attention_mask[0][0]']      
                             hidden_state=(None, 128, 7                                       

In [27]:
# Train the model
history = model.fit(
    train_dataset,
    validation_data=test_dataset,
    epochs=3,  # Adjust based on your needs
    batch_size=16
)


Epoch 1/3
Epoch 2/3
Epoch 3/3


In [28]:
# Evaluate on the test dataset
test_loss, test_accuracy = model.evaluate(test_dataset)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")


Test Loss: 0.31859487295150757, Test Accuracy: 0.8845822811126709


In [32]:
# Get a batch of data from the test dataset
for features, labels in test_dataset.take(1):  # Take one batch from the test dataset
    # Predict the outputs for this batch
    predictions = model.predict(features)
    predicted_labels = np.argmax(predictions, axis=1)  # Get the predicted label for each sample

    # Convert input IDs back to text (sentences)
    input_ids = features["input_ids"].numpy()
    sentences = [tokenizer.decode(input_id, skip_special_tokens=True) for input_id in input_ids]

    # Get actual labels from the batch
    actual_labels = labels.numpy()

    # Map encoded labels to class names
    label_mapping = {index: label for index, label in enumerate(label_encoder.classes_)}
    actual_classes = [label_mapping[label] for label in actual_labels]
    predicted_classes = [label_mapping[label] for label in predicted_labels]

    # Display the sentences along with their predicted and actual labels
    print("Sample Predictions:\n")
    for sentence, actual, predicted in zip(sentences, actual_classes, predicted_classes):
        print(f"Sentence: {sentence}")
        print(f"Actual: {actual}, Predicted: {predicted}")
        print("-" * 50)


Sample Predictions:

Sentence: ige sensitization to aspergillus fumigatus and a positive sputum fungal culture result are common in patients with refractory asthma.
Actual: BACKGROUND, Predicted: BACKGROUND
--------------------------------------------------
Sentence: it is not clear whether these patients would benefit from antifungal treatment.
Actual: BACKGROUND, Predicted: BACKGROUND
--------------------------------------------------
Sentence: we sought to determine whether a @ - month course of voriconazole improved asthma - related outcomes in patients with asthma who are ige sensitized to a fumigatus.
Actual: OBJECTIVE, Predicted: OBJECTIVE
--------------------------------------------------
Sentence: asthmatic patients who were ige sensitized to a fumigatus with a history of at least @ severe exacerbations in the previous @ months were treated for @ months with @ mg of voriconazole twice daily, followed by observation for @ months, in a double - blind, placebo - controlled, rando

In [33]:
# Save the trained model to a file
model.save("skimlit_model.h5")

print("Model saved as 'skimlit_model.h5'")


  saving_api.save_model(


Model saved as 'skimlit_model.h5'
