# Assignment 06 - CNN for Binary Drug Detection
## BMIG60030 - Fall 2025

This notebook implements a Convolutional Neural Network for binary classification of drug mentions in clinical text using the MIMIC-Ext-DrugDetection dataset.

**Binary Classification Task**: Detect whether a sentence contains any drug mention (has_drug: 0 or 1)

**Dataset**: MIMIC-Ext-DrugDetection - Using train.csv (804) and val.csv (806) with labels

**Optimized for**: Google Cloud A100 GPU

**Note**: test.csv is not used as it doesn't contain output labels

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## GPU Setup and Configuration

In [2]:
# Check GPU availability and configure TensorFlow
import tensorflow as tf
import os

# Enable memory growth to avoid OOM errors
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        print(f"Found {len(gpus)} GPU(s):")
        for i, gpu in enumerate(gpus):
            print(f"  GPU {i}: {gpu}")

        # Enable mixed precision for A100 GPU (faster training)
        from tensorflow.keras import mixed_precision
        policy = mixed_precision.Policy('mixed_float16')
        mixed_precision.set_global_policy(policy)
        print(f"\nMixed precision enabled: {policy.name}")
        print("Compute dtype:", policy.compute_dtype)
        print("Variable dtype:", policy.variable_dtype)
    except RuntimeError as e:
        print(e)
else:
    print("No GPU found. Running on CPU.")

# Set XLA compilation for additional speedup
tf.config.optimizer.set_jit(True)
print("\nXLA (Accelerated Linear Algebra) enabled for faster computation")

# Check TensorFlow version and build info
print(f"\nTensorFlow version: {tf.__version__}")
print(f"Built with CUDA: {tf.test.is_built_with_cuda()}")

Found 1 GPU(s):
  GPU 0: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')

Mixed precision enabled: mixed_float16
Compute dtype: float16
Variable dtype: float32

XLA (Accelerated Linear Algebra) enabled for faster computation

TensorFlow version: 2.19.0
Built with CUDA: True


## Setup and Imports

In [3]:
# Install required packages
!pip install gensim
!pip install nltk
!pip install pandas
!pip install scikit-learn

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m98.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [4]:
import pandas as pd
import numpy as np
from random import shuffle
import time
from multiprocessing import cpu_count

# Keras/TensorFlow imports
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint

# NLP imports
from nltk.tokenize import TreebankWordTokenizer
from gensim.models import KeyedVectors, Word2Vec
import nltk

# Download required NLTK data
nltk.download('punkt', quiet=True)

print(f"All packages imported successfully!")
print(f"Available CPU cores: {cpu_count()}")

All packages imported successfully!
Available CPU cores: 12


## Load MIMIC-Ext-DrugDetection Dataset

**Note**: We only use train.csv and val.csv as test.csv doesn't contain labels

In [31]:
# Load the datasets (only train and val have labels)
data_path = '/content/drive/MyDrive/Summer Lab Rotation/Data/MIMIC-Ext-DrugDetection'


start_time = time.time()

train_df = pd.read_csv(os.path.join(data_path, 'train.csv'))
val_df = pd.read_csv(os.path.join(data_path, 'val.csv'))
# test.csv is NOT loaded as it doesn't contain labels

print(f"Loaded in {time.time() - start_time:.2f} seconds")
print(f"\nTrain set size: {len(train_df)}")
print(f"Validation set size: {len(val_df)}")
print(f"\nClass distribution in training set:")
print(train_df['has_drug'].value_counts())
print(f"\nClass distribution in validation set:")
print(val_df['has_drug'].value_counts())
print(f"\nSample sentence from training set:")
print(train_df.iloc[0]['text'][:200])

Loaded in 0.02 seconds

Train set size: 804
Validation set size: 806

Class distribution in training set:
has_drug
True     402
False    402
Name: count, dtype: int64

Class distribution in validation set:
has_drug
False    403
True     403
Name: count, dtype: int64

Sample sentence from training set:
Service: MEDICINE Allergies: shellfish derived Attending: ___ Chief Complaint: Nausea and vomiting, abdominal pain Major Surgical or Invasive Procedure: N/A History of Present Illness: Mr. ___ is a __


In [6]:
# Convert to format expected by our functions: list of (label, text) tuples
def df_to_dataset(df):
    return [(row['has_drug'], str(row['text'])) for _, row in df.iterrows()]

train_dataset = df_to_dataset(train_df)
val_dataset = df_to_dataset(val_df)

print(f"Converted {len(train_dataset)} training samples")
print(f"Converted {len(val_dataset)} validation samples")
print(f"\nSample data point (label, text): ({train_dataset[0][0]}, '{train_dataset[0][1][:100]}...')")

Converted 804 training samples
Converted 806 validation samples

Sample data point (label, text): (True, 'Service: MEDICINE Allergies: shellfish derived Attending: ___ Chief Complaint: Nausea and vomiting, ...')


## Analyze Sentence Lengths


In [33]:
# Analyze sentence lengths across train and validation datasets

tokenizer = TreebankWordTokenizer()
all_lengths = [len(tokenizer.tokenize(text)) for _, text in train_dataset + val_dataset]


print(f"Total sentences: {len(all_lengths)}")
print(f"Min length: {min(all_lengths)}")
print(f"Max length: {max(all_lengths)}")
print(f"Mean length: {np.mean(all_lengths):.2f}")
print(f"Median length: {np.median(all_lengths):.2f}")
print(f"75th percentile: {np.percentile(all_lengths, 75):.2f}")
print(f"90th percentile: {np.percentile(all_lengths, 90):.2f}")
print(f"95th percentile: {np.percentile(all_lengths, 95):.2f}")
print(f"99th percentile: {np.percentile(all_lengths, 99):.2f}")

print(f"\nSentences <= 30 tokens: {sum(1 for l in all_lengths if l <= 30)/len(all_lengths)*100:.1f}%")
print(f"Sentences <= 50 tokens: {sum(1 for l in all_lengths if l <= 50)/len(all_lengths)*100:.1f}%")
print(f"Sentences <= 75 tokens: {sum(1 for l in all_lengths if l <= 75)/len(all_lengths)*100:.1f}%")

Total sentences: 1610
Min length: 1
Max length: 178
Mean length: 19.79
Median length: 15.00
75th percentile: 25.00
90th percentile: 40.00
95th percentile: 51.55
99th percentile: 88.91

Sentences <= 30 tokens: 81.7%
Sentences <= 50 tokens: 94.7%
Sentences <= 75 tokens: 98.1%


## Build Custom Word2Vec from MIMIC Dataset


**GPU Optimization**: Using all CPU cores for parallel Word2Vec training.

In [34]:
# Combine train and val datasets for training Word2Vec
all_texts = [text for _, text in train_dataset + val_dataset]

# Tokenize all texts
print("Tokenizing texts...")
start_time = time.time()
tokenized_texts = [tokenizer.tokenize(text.lower()) for text in all_texts]
print(f"Tokenization completed in {time.time() - start_time:.2f} seconds")

print(f"\nTraining custom Word2Vec on {len(tokenized_texts)} sentences...")
print(f"Total tokens: {sum(len(sent) for sent in tokenized_texts):,}")

start_time = time.time()
mimic_w2v_model = Word2Vec(
    sentences=tokenized_texts,
    vector_size=300,
    window=5,
    min_count=2,
    sg=1,
    workers=cpu_count(),  # Use all CPU cores
    epochs=10,
    negative=10,
    seed=42
)

training_time = time.time() - start_time
mimic_word_vectors = mimic_w2v_model.wv

print(f"\nWord2Vec training completed in {training_time:.2f} seconds")
print(f"MIMIC Word2Vec vocabulary size: {len(mimic_word_vectors):,}")
print(f"\nSample medical/drug terms in vocabulary:")
sample_words = [w for w in mimic_word_vectors.index_to_key[:100] if len(w) > 3]
print(sample_words[:20])

embedding_dims = 300

Tokenizing texts...
Tokenization completed in 0.08 seconds

Training custom Word2Vec on 1610 sentences...
Total tokens: 31,859

Word2Vec training completed in 1.46 seconds
MIMIC Word2Vec vocabulary size: 2,110

Sample medical/drug terms in vocabulary:
['with', 'history', 'abuse', 'patient', 'heroin', 'drug', 'cocaine', 'past', 'pain', 'blood', 'ivdu', 'that', 'from', 'last', 'polysubstance', 'disorder', 'medical', 'substance', 'this', 'alcohol']


## Helper Functions (GPU-Optimized)

In [9]:
def tokenize_and_vectorize(dataset, word_vectors):
    """
    Tokenize text and convert tokens to word vectors.
    Optimized for speed with list comprehensions.
    """
    tokenizer = TreebankWordTokenizer()
    vectorized_data = []

    for sample in dataset:
        tokens = tokenizer.tokenize(sample[1].lower())
        # Use list comprehension for speed
        sample_vecs = [word_vectors[token] for token in tokens if token in word_vectors]
        vectorized_data.append(sample_vecs)

    return vectorized_data


def collect_expected(dataset):
    """Extract labels from dataset."""
    return [sample[0] for sample in dataset]


def pad_trunc(data, maxlen, embedding_dims):
    """
    Pad with zero vectors or truncate to maxlen.
    Vectorized for speed.
    """
    new_data = []
    zero_vector = np.zeros(embedding_dims, dtype=np.float32)  # Use numpy for speed

    for sample in data:
        if len(sample) > maxlen:
            temp = sample[:maxlen]
        elif len(sample) < maxlen:
            # Pad with zero vectors
            temp = sample + [zero_vector] * (maxlen - len(sample))
        else:
            temp = sample
        new_data.append(temp)

    return new_data


def prepare_data(train_dataset, val_dataset, word_vectors, maxlen, embedding_dims):
    """
    Complete data preparation pipeline.
    Returns data in float32 for GPU efficiency.

    Args:
        train_dataset: Training data (list of (label, text) tuples)
        val_dataset: Validation data (used for both validation and testing)
        word_vectors: Word embeddings (Word2Vec or GloVe)
        maxlen: Maximum sequence length
        embedding_dims: Dimension of word embeddings

    Returns:
        x_train, y_train, x_val, y_val: Prepared arrays
    """
    print("Preparing data...")
    start_time = time.time()

    # Vectorize
    x_train = tokenize_and_vectorize(train_dataset, word_vectors)
    x_val = tokenize_and_vectorize(val_dataset, word_vectors)

    # Extract labels
    y_train = collect_expected(train_dataset)
    y_val = collect_expected(val_dataset)

    # Pad/truncate
    x_train = pad_trunc(x_train, maxlen, embedding_dims)
    x_val = pad_trunc(x_val, maxlen, embedding_dims)

    # Reshape to numpy arrays with float32 for GPU efficiency
    x_train = np.array(x_train, dtype=np.float32).reshape(len(x_train), maxlen, embedding_dims)
    y_train = np.array(y_train, dtype=np.float32)
    x_val = np.array(x_val, dtype=np.float32).reshape(len(x_val), maxlen, embedding_dims)
    y_val = np.array(y_val, dtype=np.float32)

    print(f"Data preparation completed in {time.time() - start_time:.2f} seconds")
    print(f"Train shape: {x_train.shape}, Val shape: {x_val.shape}")

    return x_train, y_train, x_val, y_val


def build_cnn_model(maxlen, embedding_dims, filters=250, kernel_size=3, hidden_dims=250):
    """
    Build CNN model for binary classification.
    Optimized for GPU with mixed precision.
    """
    model = Sequential()

    # Convolutional layer with ReLU activation
    model.add(Conv1D(filters,
                     kernel_size,
                     padding='valid',
                     activation='relu',
                     strides=1,
                     input_shape=(maxlen, embedding_dims)))

    # Global max pooling
    model.add(GlobalMaxPooling1D())

    # Dense hidden layer
    model.add(Dense(hidden_dims))
    model.add(Dropout(0.2))
    model.add(Activation('relu'))

    # Output layer - sigmoid for binary classification
    # Use float32 for final layer (required for mixed precision)
    model.add(Dense(1, dtype='float32'))
    model.add(Activation('sigmoid', dtype='float32'))

    # Compile with Adam optimizer and binary crossentropy
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])

    return model


def create_callbacks(model_name='best_model'):
    """
    Create training callbacks for better GPU utilization and performance.
    """
    callbacks = [
        # Early stopping to prevent overfitting
        EarlyStopping(
            monitor='val_accuracy',
            patience=3,
            restore_best_weights=True,
            verbose=1
        ),
        # Reduce learning rate when plateau
        ReduceLROnPlateau(
            monitor='val_loss',
            factor=0.5,
            patience=2,
            min_lr=1e-7,
            verbose=1
        )
    ]
    return callbacks


print("Helper functions defined successfully!")

Helper functions defined successfully!


---
# PART 1: Experiment with Different maxlen Values

The `maxlen` parameter determines:
- Input layer size of the CNN
- How many tokens from long documents are kept (truncation)
- How many zero vectors are added to short documents (padding)

Based on our sentence length analysis:
- Mean: ~20 tokens, Median: ~15 tokens  
- 90th percentile: ~40 tokens, 95th percentile: ~53 tokens

We'll test values that make sense for this distribution.

**GPU Optimization**: Using larger batch size (128) to fully utilize A100 GPU memory and throughput.

In [35]:
# Set batch size optimized for A100 GPU
batch_size = 128

### maxlen = 30
Testing around the 75th percentile - captures most sentences with minimal padding.

In [38]:
# maxlen = 30 (around 75th percentile)
maxlen_1 = 30
epochs = 5  # Using 5 epochs initially



start_time = time.time()
x_train, y_train, x_val, y_val = prepare_data(
    train_dataset, val_dataset,
    mimic_word_vectors, maxlen_1, embedding_dims
)

model_maxlen_30 = build_cnn_model(maxlen_1, embedding_dims)
print("\nModel architecture:")
model_maxlen_30.summary()

print("\nTraining model on GPU...")
train_start = time.time()
history_maxlen_30 = model_maxlen_30.fit(
    x_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(x_val, y_val),
    callbacks=create_callbacks('maxlen_30'),
    verbose=1
)
train_time = time.time() - train_start

# Evaluate on validation set (used as test set)
val_loss_30, val_acc_30 = model_maxlen_30.evaluate(x_val, y_val, verbose=0)
total_time = time.time() - start_time

print(f"\n*** Validation Accuracy with maxlen=30: {val_acc_30:.4f} ***")
print(f"Training time: {train_time:.2f}s | Total time: {total_time:.2f}s")

Preparing data...
Data preparation completed in 0.16 seconds
Train shape: (804, 30, 300), Val shape: (806, 30, 300)

Model architecture:



Training model on GPU...
Epoch 1/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 498ms/step - accuracy: 0.6150 - loss: 0.6401 - val_accuracy: 0.8573 - val_loss: 0.4222 - learning_rate: 0.0010
Epoch 2/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.8631 - loss: 0.4070 - val_accuracy: 0.8859 - val_loss: 0.3040 - learning_rate: 0.0010
Epoch 3/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.8901 - loss: 0.3031 - val_accuracy: 0.9007 - val_loss: 0.2662 - learning_rate: 0.0010
Epoch 4/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.8908 - loss: 0.2825 - val_accuracy: 0.8983 - val_loss: 0.2641 - learning_rate: 0.0010
Epoch 5/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - accuracy: 0.9088 - loss: 0.2333 - val_accuracy: 0.8995 - val_loss: 0.2727 - learning_rate: 0.0010
Restoring model weights from the end of the best epoch: 3.

***

###  maxlen = 50  
Testing around the 95th percentile - captures almost all sentences.

In [37]:
# maxlen = 50 (around 95th percentile)
maxlen_2 = 50


start_time = time.time()
x_train, y_train, x_val, y_val = prepare_data(
    train_dataset, val_dataset,
    mimic_word_vectors, maxlen_2, embedding_dims
)

model_maxlen_50 = build_cnn_model(maxlen_2, embedding_dims)

print("Training model on GPU...")
train_start = time.time()
history_maxlen_50 = model_maxlen_50.fit(
    x_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(x_val, y_val),
    callbacks=create_callbacks('maxlen_50'),
    verbose=1
)
train_time = time.time() - train_start

val_loss_50, val_acc_50 = model_maxlen_50.evaluate(x_val, y_val, verbose=0)
total_time = time.time() - start_time

print(f"\n*** Validation Accuracy with maxlen=50: {val_acc_50:.4f} ***")
print(f"Training time: {train_time:.2f}s | Total time: {total_time:.2f}s")

Preparing data...
Data preparation completed in 0.18 seconds
Train shape: (804, 50, 300), Val shape: (806, 50, 300)
Training model on GPU...
Epoch 1/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 522ms/step - accuracy: 0.6577 - loss: 0.6154 - val_accuracy: 0.8362 - val_loss: 0.4023 - learning_rate: 0.0010
Epoch 2/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.8457 - loss: 0.4031 - val_accuracy: 0.9007 - val_loss: 0.2850 - learning_rate: 0.0010
Epoch 3/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.8770 - loss: 0.3196 - val_accuracy: 0.9020 - val_loss: 0.2645 - learning_rate: 0.0010
Epoch 4/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.8915 - loss: 0.3038 - val_accuracy: 0.9082 - val_loss: 0.2493 - learning_rate: 0.0010
Epoch 5/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.9126 - loss: 0.2171 - val_accuracy

###  maxlen = 75


In [39]:
# maxlen = 75 (captures ~98-99% of sentences)
maxlen_3 = 75



start_time = time.time()
x_train, y_train, x_val, y_val = prepare_data(
    train_dataset, val_dataset,
    mimic_word_vectors, maxlen_3, embedding_dims
)

model_maxlen_75 = build_cnn_model(maxlen_3, embedding_dims)

print("Training model on GPU...")
train_start = time.time()
history_maxlen_75 = model_maxlen_75.fit(
    x_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    validation_data=(x_val, y_val),
    callbacks=create_callbacks('maxlen_75'),
    verbose=1
)
train_time = time.time() - train_start

val_loss_75, val_acc_75 = model_maxlen_75.evaluate(x_val, y_val, verbose=0)
total_time = time.time() - start_time

print(f"\n*** Validation Accuracy with maxlen=75: {val_acc_75:.4f} ***")
print(f"Training time: {train_time:.2f}s | Total time: {total_time:.2f}s")

Preparing data...
Data preparation completed in 0.21 seconds
Train shape: (804, 75, 300), Val shape: (806, 75, 300)
Training model on GPU...
Epoch 1/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 523ms/step - accuracy: 0.5747 - loss: 0.6406 - val_accuracy: 0.8610 - val_loss: 0.4130 - learning_rate: 0.0010
Epoch 2/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8640 - loss: 0.4035 - val_accuracy: 0.9045 - val_loss: 0.2838 - learning_rate: 0.0010
Epoch 3/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8971 - loss: 0.2903 - val_accuracy: 0.9057 - val_loss: 0.2583 - learning_rate: 0.0010
Epoch 4/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8920 - loss: 0.2795 - val_accuracy: 0.9144 - val_loss: 0.2449 - learning_rate: 0.0010
Epoch 5/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.9096 - loss: 0.2358 - val_accuracy

In [41]:
# Compare maxlen results

print(f"maxlen=30:  Validation Accuracy = {val_acc_30:.4f}")
print(f"maxlen=50:  Validation Accuracy = {val_acc_50:.4f}")
print(f"maxlen=75:  Validation Accuracy = {val_acc_75:.4f}")

# Select best maxlen
best_maxlen_results = {
    30: val_acc_30,
    50: val_acc_50,
    75: val_acc_75
}
best_maxlen = max(best_maxlen_results, key=best_maxlen_results.get)
print(f"\n*** BEST maxlen: {best_maxlen} with accuracy {best_maxlen_results[best_maxlen]:.4f} ***")


maxlen=30:  Validation Accuracy = 0.9007
maxlen=50:  Validation Accuracy = 0.9082
maxlen=75:  Validation Accuracy = 0.9144

*** BEST maxlen: 75 with accuracy 0.9144 ***


---
# PART 2: Experiment with Different Epoch Values

We'll use the best maxlen from Part 1 and experiment with different numbers of training epochs to find the optimal value (balancing learning vs. overfitting).

In [42]:
# Use best maxlen from Part 1
maxlen_final = best_maxlen
print(f"Using maxlen = {maxlen_final} for epoch experiments")

# Prepare data once with best maxlen
x_train, y_train, x_val, y_val = prepare_data(
    train_dataset, val_dataset,
    mimic_word_vectors, maxlen_final, embedding_dims
)

Using maxlen = 75 for epoch experiments
Preparing data...
Data preparation completed in 0.19 seconds
Train shape: (804, 75, 300), Val shape: (806, 75, 300)


###  epochs = 3

In [43]:
epochs_1 = 3
""
train_start = time.time()
model_epochs_3 = build_cnn_model(maxlen_final, embedding_dims)
history_epochs_3 = model_epochs_3.fit(
    x_train, y_train,
    batch_size=batch_size,
    epochs=epochs_1,
    validation_data=(x_val, y_val),
    callbacks=create_callbacks('epochs_3'),
    verbose=1
)
train_time = time.time() - train_start

val_loss_e3, val_acc_e3 = model_epochs_3.evaluate(x_val, y_val, verbose=0)
final_val_acc_e3 = history_epochs_3.history['val_accuracy'][-1]

print(f"\nFinal Validation Accuracy: {final_val_acc_e3:.4f}")
print(f"*** Validation Accuracy with epochs=3: {val_acc_e3:.4f} ***")
print(f"Training time: {train_time:.2f}s")

Epoch 1/3
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 521ms/step - accuracy: 0.6664 - loss: 0.6103 - val_accuracy: 0.8164 - val_loss: 0.4024 - learning_rate: 0.0010
Epoch 2/3
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - accuracy: 0.8263 - loss: 0.4357 - val_accuracy: 0.9032 - val_loss: 0.2886 - learning_rate: 0.0010
Epoch 3/3
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8806 - loss: 0.3177 - val_accuracy: 0.8958 - val_loss: 0.2805 - learning_rate: 0.0010
Restoring model weights from the end of the best epoch: 2.

Final Validation Accuracy: 0.8958
*** Validation Accuracy with epochs=3: 0.9032 ***
Training time: 6.81s


###  epochs = 5

In [44]:
epochs_2 = 5



train_start = time.time()
model_epochs_5 = build_cnn_model(maxlen_final, embedding_dims)
history_epochs_5 = model_epochs_5.fit(
    x_train, y_train,
    batch_size=batch_size,
    epochs=epochs_2,
    validation_data=(x_val, y_val),
    callbacks=create_callbacks('epochs_5'),
    verbose=1
)
train_time = time.time() - train_start

val_loss_e5, val_acc_e5 = model_epochs_5.evaluate(x_val, y_val, verbose=0)
final_val_acc_e5 = history_epochs_5.history['val_accuracy'][-1]

print(f"\nFinal Validation Accuracy: {final_val_acc_e5:.4f}")
print(f"*** Validation Accuracy with epochs=5: {val_acc_e5:.4f} ***")
print(f"Training time: {train_time:.2f}s")

Epoch 1/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 527ms/step - accuracy: 0.6278 - loss: 0.6223 - val_accuracy: 0.8772 - val_loss: 0.4126 - learning_rate: 0.0010
Epoch 2/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - accuracy: 0.8659 - loss: 0.4086 - val_accuracy: 0.9032 - val_loss: 0.2971 - learning_rate: 0.0010
Epoch 3/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8818 - loss: 0.3105 - val_accuracy: 0.8921 - val_loss: 0.2875 - learning_rate: 0.0010
Epoch 4/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8927 - loss: 0.2826 - val_accuracy: 0.9107 - val_loss: 0.2506 - learning_rate: 0.0010
Epoch 5/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.9134 - loss: 0.2227 - val_accuracy: 0.9032 - val_loss: 0.2575 - learning_rate: 0.0010
Restoring model weights from the end of the best epoch: 4.

Final Validation Accuracy: 0.

### epochs = 8

In [45]:
epochs_3 = 8

train_start = time.time()
model_epochs_8 = build_cnn_model(maxlen_final, embedding_dims)
history_epochs_8 = model_epochs_8.fit(
    x_train, y_train,
    batch_size=batch_size,
    epochs=epochs_3,
    validation_data=(x_val, y_val),
    callbacks=create_callbacks('epochs_8'),
    verbose=1
)
train_time = time.time() - train_start

val_loss_e8, val_acc_e8 = model_epochs_8.evaluate(x_val, y_val, verbose=0)
final_val_acc_e8 = history_epochs_8.history['val_accuracy'][-1]

print(f"\nFinal Validation Accuracy: {final_val_acc_e8:.4f}")
print(f"*** Validation Accuracy with epochs=8: {val_acc_e8:.4f} ***")
print(f"Training time: {train_time:.2f}s")

Epoch 1/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 525ms/step - accuracy: 0.6452 - loss: 0.6208 - val_accuracy: 0.8859 - val_loss: 0.4142 - learning_rate: 0.0010
Epoch 2/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8783 - loss: 0.4019 - val_accuracy: 0.8797 - val_loss: 0.3068 - learning_rate: 0.0010
Epoch 3/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8829 - loss: 0.3264 - val_accuracy: 0.9082 - val_loss: 0.2635 - learning_rate: 0.0010
Epoch 4/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - accuracy: 0.8993 - loss: 0.2653 - val_accuracy: 0.9094 - val_loss: 0.2498 - learning_rate: 0.0010
Epoch 5/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.9137 - loss: 0.2532 - val_accuracy: 0.8945 - val_loss: 0.2724 - learning_rate: 0.0010
Epoch 6/8
[1m1/7[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 29ms/step - accur

###  epochs = 10 (checking for overfitting)


In [46]:
epochs_4 = 10

train_start = time.time()
model_epochs_10 = build_cnn_model(maxlen_final, embedding_dims)
history_epochs_10 = model_epochs_10.fit(
    x_train, y_train,
    batch_size=batch_size,
    epochs=epochs_4,
    validation_data=(x_val, y_val),
    callbacks=create_callbacks('epochs_10'),
    verbose=1
)
train_time = time.time() - train_start

val_loss_e10, val_acc_e10 = model_epochs_10.evaluate(x_val, y_val, verbose=0)
final_val_acc_e10 = history_epochs_10.history['val_accuracy'][-1]

print(f"\nFinal Validation Accuracy: {final_val_acc_e10:.4f}")
print(f"*** Validation Accuracy with epochs=10: {val_acc_e10:.4f} ***")
print(f"Training time: {train_time:.2f}s")

Epoch 1/10
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 528ms/step - accuracy: 0.5763 - loss: 0.6368 - val_accuracy: 0.8722 - val_loss: 0.4221 - learning_rate: 0.0010
Epoch 2/10
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - accuracy: 0.8462 - loss: 0.4214 - val_accuracy: 0.8933 - val_loss: 0.3052 - learning_rate: 0.0010
Epoch 3/10
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step - accuracy: 0.8828 - loss: 0.3147 - val_accuracy: 0.9045 - val_loss: 0.2599 - learning_rate: 0.0010
Epoch 4/10
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8999 - loss: 0.2733 - val_accuracy: 0.8871 - val_loss: 0.2972 - learning_rate: 0.0010
Epoch 5/10
[1m1/7[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 28ms/step - accuracy: 0.8594 - loss: 0.3450
Epoch 5: ReduceLROnPlateau reducing learning rate to 0.0005000000237487257.
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - 

In [47]:
# Compare epoch results
print("\n" + "="*60)
print("PART 2 RESULTS: Epochs Comparison")
print("="*60)
print(f"epochs=3:  Validation Accuracy = {val_acc_e3:.4f}")
print(f"epochs=5:  Validation Accuracy = {val_acc_e5:.4f}")
print(f"epochs=8:  Validation Accuracy = {val_acc_e8:.4f}")
print(f"epochs=10: Validation Accuracy = {val_acc_e10:.4f}")

# Select best epochs based on validation accuracy
best_epochs_results = {
    3: val_acc_e3,
    5: val_acc_e5,
    8: val_acc_e8,
    10: val_acc_e10
}
best_epochs = max(best_epochs_results, key=best_epochs_results.get)
print(f"\n*** BEST epochs: {best_epochs} with validation accuracy {best_epochs_results[best_epochs]:.4f} ***")
print("="*60)


PART 2 RESULTS: Epochs Comparison
epochs=3:  Validation Accuracy = 0.9032
epochs=5:  Validation Accuracy = 0.9107
epochs=8:  Validation Accuracy = 0.9156
epochs=10: Validation Accuracy = 0.9045

*** BEST epochs: 8 with validation accuracy 0.9156 ***


---
# PART 4: Experiment with Different Word Embeddings

We'll compare:
1. **Custom MIMIC Word2Vec** (domain-specific, trained on clinical text) - already tested above
2. **GloVe embeddings** (general-purpose, pre-trained on Wikipedia/web text)

This comparison tests whether domain-specific embeddings outperform general-purpose embeddings for clinical NLP tasks.

##  Load GloVe Embeddings

**Download GloVe vectors from**: https://nlp.stanford.edu/projects/glove/



In [22]:
# Download GloVe vectors
!wget https://nlp.stanford.edu/data/glove.6B.zip -O /tmp/glove.6B.zip
!unzip /tmp/glove.6B.zip -d /tmp/

--2025-11-04 04:24:00--  https://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2025-11-04 04:24:01--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘/tmp/glove.6B.zip’


2025-11-04 04:26:45 (5.04 MB/s) - ‘/tmp/glove.6B.zip’ saved [862182613/862182613]

Archive:  /tmp/glove.6B.zip
  inflating: /tmp/glove.6B.50d.txt   
  inflating: /tmp/glove.6B.100d.txt  
  inflating: /tmp/glove.6B.200d.txt  
  inflating: /tmp/glove.6B.300d.txt  

In [48]:
def load_glove_vectors(glove_file, limit=None):
    """
    Load GloVe vectors into Gensim KeyedVectors format.
    Optimized for speed with numpy operations.
    """
    print(f"Loading GloVe vectors from {glove_file}...")
    start_time = time.time()

    embeddings = {}
    count = 0

    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            embeddings[word] = vector

            count += 1
            if limit and count >= limit:
                break

            if count % 100000 == 0:
                print(f"  Loaded {count:,} vectors...")

    # Convert to KeyedVectors format
    vector_size = len(next(iter(embeddings.values())))
    kv = KeyedVectors(vector_size=vector_size)
    kv.add_vectors(list(embeddings.keys()), list(embeddings.values()))

    load_time = time.time() - start_time
    print(f"\nLoaded {len(kv):,} GloVe vectors ({vector_size} dimensions) in {load_time:.2f} seconds")
    return kv

In [25]:
# Load GloVe vectors
# Update this path to where you have GloVe vectors downloaded
glove_path = '/tmp/glove.6B.300d.txt'

# Load all GloVe vectors (or set limit=200000)
glove_word_vectors = load_glove_vectors(glove_path)

Loading GloVe vectors from /tmp/glove.6B.300d.txt...
  Loaded 100,000 vectors...
  Loaded 200,000 vectors...
  Loaded 300,000 vectors...
  Loaded 400,000 vectors...

Loaded 400,000 GloVe vectors (300 dimensions) in 25.25 seconds


## 4.2: Train and Evaluate with GloVe Embeddings

In [49]:

print(f"Using best configuration: maxlen={maxlen_final}, epochs={best_epochs}")

start_time = time.time()

# Prepare data with GloVe vectors
x_train_glove, y_train_glove, x_val_glove, y_val_glove = prepare_data(
    train_dataset, val_dataset,
    glove_word_vectors, maxlen_final, embedding_dims
)

# Train model
print("\nTraining model on GPU with GloVe embeddings...")
train_start = time.time()
model_glove = build_cnn_model(maxlen_final, embedding_dims)
history_glove = model_glove.fit(
    x_train_glove, y_train_glove,
    batch_size=batch_size,
    epochs=best_epochs,
    validation_data=(x_val_glove, y_val_glove),
    callbacks=create_callbacks('glove'),
    verbose=1
)
train_time = time.time() - train_start

val_loss_glove, val_acc_glove = model_glove.evaluate(x_val_glove, y_val_glove, verbose=0)
final_val_acc_glove = history_glove.history['val_accuracy'][-1]
total_time = time.time() - start_time

print(f"\nFinal Validation Accuracy: {final_val_acc_glove:.4f}")
print(f"*** Validation Accuracy with GloVe: {val_acc_glove:.4f} ***")
print(f"Training time: {train_time:.2f}s | Total time: {total_time:.2f}s")

Using best configuration: maxlen=75, epochs=8
Preparing data...
Data preparation completed in 0.17 seconds
Train shape: (804, 75, 300), Val shape: (806, 75, 300)

Training model on GPU with GloVe embeddings...
Epoch 1/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 526ms/step - accuracy: 0.5592 - loss: 0.7261 - val_accuracy: 0.8102 - val_loss: 0.4351 - learning_rate: 0.0010
Epoch 2/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step - accuracy: 0.8396 - loss: 0.3932 - val_accuracy: 0.8871 - val_loss: 0.3248 - learning_rate: 0.0010
Epoch 3/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 88ms/step - accuracy: 0.9104 - loss: 0.2496 - val_accuracy: 0.8921 - val_loss: 0.3015 - learning_rate: 0.0010
Epoch 4/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step - accuracy: 0.9405 - loss: 0.1973 - val_accuracy: 0.9007 - val_loss: 0.2621 - learning_rate: 0.0010
Epoch 5/8
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

In [51]:
# Compare all embedding results

# Get MIMIC Word2Vec result from best epochs experiment
val_acc_mimic = best_epochs_results[best_epochs]

print(f"Custom MIMIC Word2Vec: Validation Accuracy = {val_acc_mimic:.4f}")
print(f"GloVe Embeddings:      Validation Accuracy = {val_acc_glove:.4f}")

# Select best embedding
embedding_results = {
    'Custom MIMIC Word2Vec': val_acc_mimic,
    'GloVe': val_acc_glove
}
best_embedding = max(embedding_results, key=embedding_results.get)
improvement = abs(val_acc_mimic - val_acc_glove)

print(f"\n*** BEST embedding: {best_embedding} ***")
print(f"*** Best accuracy: {embedding_results[best_embedding]:.4f} ***")
print(f"*** Difference: {improvement:.4f} ({improvement*100:.2f}%) ***")


Custom MIMIC Word2Vec: Validation Accuracy = 0.9156
GloVe Embeddings:      Validation Accuracy = 0.9206

*** BEST embedding: GloVe ***
*** Best accuracy: 0.9206 ***
*** Difference: 0.0050 (0.50%) ***


---
# Final Optimized Model Summary

In [53]:

print(f"\nHardware: Google Cloud A100 GPU")
print(f"  Mixed Precision Training: Enabled (float16/float32)")
print(f"  XLA Compilation: Enabled")
print(f"\nDataset: MIMIC-Ext-DrugDetection")
print(f"  Train: {len(train_dataset)} samples")
print(f"  Val/Test: {len(val_dataset)} samples")
print(f"\nOptimal Hyperparameters:")
print(f"  maxlen:     {maxlen_final}")
print(f"  epochs:     {best_epochs}")
print(f"  embeddings: {best_embedding}")
print(f"  batch_size: {batch_size} (GPU-optimized)")
print(f"  filters:    250")
print(f"  kernel_size: 3")
print(f"\nBest Results:")
print(f"  Validation Accuracy: {max(embedding_results.values()):.4f}")



Hardware: Google Cloud A100 GPU
  Mixed Precision Training: Enabled (float16/float32)
  XLA Compilation: Enabled

Dataset: MIMIC-Ext-DrugDetection
  Train: 804 samples
  Val/Test: 806 samples

Optimal Hyperparameters:
  maxlen:     75
  epochs:     8
  embeddings: GloVe
  batch_size: 128 (GPU-optimized)
  filters:    250
  kernel_size: 3

Best Results:
  Validation Accuracy: 0.9206


## Test the Final Model on Sample Sentences

In [54]:
def predict_drug_mention(text, model, word_vectors, maxlen, embedding_dims):
    """
    Predict whether a sentence contains a drug mention.
    Returns: probability score (0-1)
    """
    sample_dataset = [(0, text)]
    vectorized = tokenize_and_vectorize(sample_dataset, word_vectors)
    padded = pad_trunc(vectorized, maxlen, embedding_dims)
    x = np.array(padded, dtype=np.float32).reshape(1, maxlen, embedding_dims)
    prediction = model.predict(x, verbose=0)[0][0]
    return prediction


# Select the best model and vectors
if best_embedding == 'Custom MIMIC Word2Vec':
    if best_epochs == 3:
        final_model = model_epochs_3
    elif best_epochs == 5:
        final_model = model_epochs_5
    elif best_epochs == 8:
        final_model = model_epochs_8
    else:
        final_model = model_epochs_10
    final_vectors = mimic_word_vectors
else:  # GloVe
    final_model = model_glove
    final_vectors = glove_word_vectors

# Test on sample sentences
test_sentences = [
    "#Moderate normocytic anemia, stable #Diffuse marrow low signal in the spine #Splenomegaly #Elevated light chains in the serum, but with normal SPEP and UPEP -Suspect acutesuspect inflammatory block on his marrow from IVDU and infection -recommend follow-up with Hematology as an outpatient in ___ weeks",
    "Substance abuse/withdrawal - The patient had recently used both cocaine and heroin prior to admission.",
    "The patient's nephew agreed and felt that patient should be clinically cleared as much as possible and then allowed to sit up to prevent aspiration..",
    "Pt is a ___ y.o male with h.o ETOH/opiate abuse, depression, HTN who presented with SI and ETOH intoxication/withdrawal.",
    "The patient underwent rectal swabs for VRE preoperatively which was negative.",
    "Status post splenectomy.",
    "Findings were discussed with Dr. [ * * Last Name ( STitle ) * * ] at the time of dictation CULTURE DATA: [ * * <<DATE>> * * ]",
    "Per report, he took ten 1 mg pills of Xanax at 4pm on the day prior to admission and 2 1mg Xanax tablets again on the day of admission, 3 hours prior to presentation.",
    "Alert, oriented x 3.",
    "Past medical history: IVDA Anxiety Depression"
]

#True, True, False, True, False, False, False, True, False, True

for sent in test_sentences:
    pred = predict_drug_mention(sent, final_model, final_vectors, maxlen_final, embedding_dims)
    label = "DRUG MENTION" if pred > 0.5 else "NO DRUG"
    print(f"{label:12s} (score: {pred:.4f})  |  {sent}")


NO DRUG      (score: 0.4178)  |  #Moderate normocytic anemia, stable #Diffuse marrow low signal in the spine #Splenomegaly #Elevated light chains in the serum, but with normal SPEP and UPEP -Suspect acutesuspect inflammatory block on his marrow from IVDU and infection -recommend follow-up with Hematology as an outpatient in ___ weeks
DRUG MENTION (score: 0.9994)  |  Substance abuse/withdrawal - The patient had recently used both cocaine and heroin prior to admission.
NO DRUG      (score: 0.1345)  |  The patient's nephew agreed and felt that patient should be clinically cleared as much as possible and then allowed to sit up to prevent aspiration..
DRUG MENTION (score: 0.9963)  |  Pt is a ___ y.o male with h.o ETOH/opiate abuse, depression, HTN who presented with SI and ETOH intoxication/withdrawal.
NO DRUG      (score: 0.1090)  |  The patient underwent rectal swabs for VRE preoperatively which was negative.
NO DRUG      (score: 0.0120)  |  Status post splenectomy.
NO DRUG      (score: 0