# Seismic Interpretation Baseline Analysis
## Jupyter Notebook: Baseline Models

This Jupyter Notebook sets up and runs initial baseline models for seismic facies classification. Using a simple
Convolutional Neural Network and an adapted pre-trained RoBERTa(TBD might use more advanced) model as benchmarks. This
includes the demonstrate training loops, evaluation metrics, and plotting of training history.

However this docuemnt is going to exclude the combination of the cnn with the pretrained mdoel, the optimization of the methodology to get the best seismic stratigraphy model, and reinforcement learning that will be used to train for different geologic regions and strata.

In [12]:
import numpy as np
import torch
from torch.utils.data import DataLoader, TensorDataset
!pip install segyio scipy
import segyio
from scipy.signal import butter, filtfilt, hilbert

# reproducibility
np.random.seed(42)
torch.manual_seed(42)

segy_path = "Seismic_data.sgy"
tops_path = "f3_dataset.sgy"
import torch.nn as nn

# CNN history
#cnn_train_loss, cnn_val_loss = [], []
#cnn_train_acc,  cnn_val_acc  = [], []

# RoBERTa history
#roberta_train_loss, roberta_val_loss = [], []
#roberta_train_acc,  roberta_val_acc  = [], []

# 1) Load raw seismic cube to be analyzed
with segyio.open(segy_path, "r", ignore_geometry=True) as f:
    f.mmap()
    try:
        volume = segyio.tools.cube(f)  # shape (n_ilines, n_xlines, n_samples)
    except Exception:
        # fallback: build cube from traces + headers
        n_traces  = f.tracecount
        n_samples = len(f.samples)
        raw       = np.stack([f.trace.raw[i] for i in range(n_traces)], axis=0)
        inlines   = f.attributes(segyio.TraceField.INLINE_3D)[:]
        xlines    = f.attributes(segyio.TraceField.CROSSLINE_3D)[:]
        uni_il    = np.unique(inlines)
        uni_xl    = np.unique(xlines)
        n_il, n_xl= len(uni_il), len(uni_xl)
        volume    = np.zeros((n_il, n_xl, n_samples), dtype=raw.dtype)
        il_map    = {val: idx for idx, val in enumerate(uni_il)}
        xl_map    = {val: idx for idx, val in enumerate(uni_xl)}
        for t in range(n_traces):
            iidx = il_map[inlines[t]]
            xidx = xl_map[xlines[t]]
            volume[iidx, xidx, :] = raw[t]

# 2) Band‑pass + envelope extraction
def bandpass_filter(trace, lowcut=5, highcut=60, fs=250, order=4):
    nyq = 0.5 * fs
    b, a = butter(order, [lowcut/nyq, highcut/nyq], btype="band")
    return filtfilt(b, a, trace)

n_ilines, n_xlines, n_samples = volume.shape
vol_bp = np.zeros_like(volume, dtype=np.float32)

for il in range(n_ilines):
    for xl in range(n_xlines):
        tr    = volume[il, xl, :].astype(np.float32)
        filt  = bandpass_filter(tr, lowcut=5, highcut=60, fs=250)
        env   = np.abs(hilbert(filt))
        vol_bp[il, xl, :] = env

# 3) Outlier clipping + normalization
p1, p99  = np.percentile(vol_bp, [1, 99])
vol_bp   = np.clip(vol_bp, p1, p99)
mean, std= vol_bp.mean(), vol_bp.std()
vol_norm = (vol_bp - mean) / (std + 1e-8)

print(f"Preprocessed volume shape={vol_norm.shape}, mean={vol_norm.mean():.3f}, std={vol_norm.std():.3f}")

Preprocessed volume shape=(651, 951, 462), mean=0.000, std=1.000


This cell implements a complete preprocessing pipeline that converts a raw SEG‑Y volume into a normalized feature cube suitable for patch extraction and CNN training. First, the necessary libraries are imported and installed—NumPy and PyTorch for array and tensor operations, segyio for efficient memory‑mapped reading of seismic files, and SciPy’s signal module for band‑pass filtering and envelope extraction. Random seeds in NumPy and PyTorch are fixed to ensure reproducibility of any stochastic routines. The primary SEG‑Y file is then opened in read mode with segyio and memory‑mapped; `segyio.tools.cube` uses the inline/crossline headers to assemble a three‑dimensional array of shape `(n_inlines, n_crosslines, n_samples)`. If header information is unavailable or malformed, each raw trace is read, its inline and crossline coordinates extracted, and the full volume reconstructed manually via dictionary lookups.

Once the raw 3D volume is obtained, every inline×crossline trace undergoes a fourth‑order Butterworth band‑pass
filter between 5Hz and 60Hz (zero‑phase forward/backward filtering via `filtfilt` to prevent phase distortion). The
analytic signal of the filtered trace is computed using the Hilbert transform, and its absolute value (the envelope)
is extracted to emphasize reflector strength, a key stratigraphic attribute. These envelopes populate the temporary
cube `vol_bp`. To mitigate the influence of acquisition spikes and extreme outliers, the 1st and 99th percentiles of
all envelope values are computed and any values outside this range are clipped. Finally, a global standardization is
applied by subtracting the mean and dividing by the standard deviation (with a small epsilon added to avoid division
by zero), producing `vol_norm`, which exhibit zero mean and unit variance. A print statement then confirms the
preprocessed volume’s dimensions and verifies that its mean and standard deviation are approximately 0 and 1,
respectively—indicating readiness for the downstream analysis.

### Data Loading and Preparation

In [13]:
# Code from segyio paper to load tops
with segyio.open(tops_path, "r", ignore_geometry=True) as f2:
    f2.mmap()
    try:
        tops_vol = segyio.tools.cube(f2)  # shape might be (n_il, n_xl, 1) or (n_il, n_xl, n_samples)
    except Exception:
        # Fallback: stack raw traces and reshape based on inline/crossline headers
        n_traces = f2.tracecount
        n_samples = len(f2.samples)
        raw = np.stack([f2.trace.raw[i] for i in range(n_traces)], axis=0)
        inlines = f2.attributes(segyio.TraceField.INLINE_3D)[:]
        xlines = f2.attributes(segyio.TraceField.CROSSLINE_3D)[:]
        uni_il = np.unique(inlines)
        uni_xl = np.unique(xlines)
        il_map = {v: i for i, v in enumerate(uni_il)}
        xl_map = {v: i for i, v in enumerate(uni_xl)}
        tops_vol = np.zeros((len(uni_il), len(uni_xl), n_samples), dtype=raw.dtype)
        for t in range(n_traces):
            i = il_map[inlines[t]]
            j = xl_map[xlines[t]]
            tops_vol[i, j, :] = raw[t]

# Now reduce to a 2D pick index array:
if tops_vol.ndim == 3 and tops_vol.shape[2] == 1:
    tops = tops_vol[:, :, 0]
elif tops_vol.ndim == 3:
    # e.g. picks stored as amplitude across samples—take the mean and round
    tops = np.round(tops_vol.mean(axis=2)).astype(int)
else:
    raise ValueError(f"Unexpected tops_vol shape: {tops_vol.shape}")

print("Loaded tops shape:", tops.shape)

Loaded tops shape: (651, 951)


In [14]:
# parameters
patch_size = 32
stride     = 32
image_size  = patch_size   # for the CNN flatten
num_classes = 2            # binary facies to be changed later for multi-class context
n_il, n_xl, n_sm = volume.shape

# normalize entire volume to zero-mean/unit-variance
vol_mean = volume.mean()
vol_std  = volume.std()
vol_norm = (volume - vol_mean) / (vol_std + 1e-8)

patches = []
labels  = []

for il in range(n_il):
    for xl in range(0, n_xl - patch_size + 1, stride):
        for sm in range(0, n_sm - patch_size + 1, stride):
            patch = vol_norm[il, xl:xl+patch_size, sm:sm+patch_size]
            patches.append(patch)
            # label = 1 if patch center depth (sample index) is below the horizon pick
            top_idx      = tops[il, xl]
            center_depth = sm + patch_size//2
            labels.append(int(center_depth > top_idx))

# stack & reshape for PyTorch (N, C=1, H, W)
X = np.stack(patches)[:, None, :, :].astype(np.float32)
y = np.array(labels, dtype=np.int64)

# check if the data is balanced and set num_classes which corresponds to the number of stratigraphic classes(bed types)
unique_classes = np.unique(y)
num_classes    = len(unique_classes)
print(f"Detected classes: {unique_classes}  →  num_classes = {num_classes}")

# train/val split (80/20) and DataLoader
split = int(0.8 * len(y))
X_train, X_val = X[:split], X[split:]
y_train, y_val = y[:split], y[split:]

train_ds = TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
val_ds   = TensorDataset(torch.tensor(X_val),   torch.tensor(y_val))
train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)
val_loader   = DataLoader(val_ds,   batch_size=16, shuffle=False)

print("Prepared", len(train_ds), "train and", len(val_ds), "val patches.")

Detected classes: [0 1]  →  num_classes = 2
Prepared 211444 train and 52862 val patches.


Simple CNN Baseline Model

In [15]:
# Define a simple CNN (based off previous work with cnns)
class SeismicCNN(nn.Module):
    def __init__(self, in_channels=1, num_classes=num_classes, patch_size=32):
        super().__init__()
        # Feature extractor: two conv→BN→ReLU→pool blocks
        self.features = nn.Sequential(
            nn.Conv2d(in_channels, 16, kernel_size=3, padding=1),  # 32×32→32×32
            nn.BatchNorm2d(16),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),                                       # →16×16

            nn.Conv2d(16, 32, kernel_size=3, padding=1),           # 16×16→16×16
            nn.BatchNorm2d(32),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),                                       # →8×8
        )
        # Compute flattened feature size after two pools
        feat_dim = patch_size // 4  # 32→16→8
        flat_size = 32 * feat_dim * feat_dim

        # Classifier head
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(flat_size, 64),
            nn.ReLU(inplace=True),
            nn.Dropout(0.3),
            nn.Linear(64, num_classes),
        )

    def forward(self, x):
        x = self.features(x)
        x = self.classifier(x)
        return x

# init
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cnn_model = SeismicCNN().to(device)
criterion   = nn.CrossEntropyLoss()
optimizer_cnn = torch.optim.Adam(cnn_model.parameters(), lr=1e-3)

print(cnn_model)

SeismicCNN(
  (features): Sequential(
    (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU(inplace=True)
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classifier): Sequential(
    (0): Flatten(start_dim=1, end_dim=-1)
    (1): Linear(in_features=2048, out_features=64, bias=True)
    (2): ReLU(inplace=True)
    (3): Dropout(p=0.3, inplace=False)
    (4): Linear(in_features=64, out_features=2, bias=True)
  )
)


Pre-trained RoBERTa Baseline Model

In [16]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig

# Prepare text representations of images for RoBERTa
X_text_train = []
for img in X_train:
    # Flatten image and convert pixel values to tokens ("bright"/"dark")
    pixels = img.flatten()
    tokens = ["bright" if px > 0.5 else "dark" for px in pixels]
    X_text_train.append(" ".join(tokens))
X_text_val = []
for img in X_val:
    pixels = img.flatten()
    tokens = ["bright" if px > 0.5 else "dark" for px in pixels]
    X_text_val.append(" ".join(tokens))

# Load pre-trained RoBERTa model and tokenizer for sequence classification
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
config = AutoConfig.from_pretrained("roberta-base", num_labels=num_classes)
roberta_model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=num_classes)
roberta_model = roberta_model.to(device)
optimizer_roberta = torch.optim.Adam(roberta_model.parameters(), lr=1e-5)

print(roberta_model)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
         

Training Loop for CNN

In [18]:
from sklearn.model_selection import KFold, ParameterGrid
import torch.nn as nn
import torch.optim as optim

# 1) Define hyperparameter grid
param_grid = {
    "lr": [1e-3, 5e-4],
    "batch_size": [16, 32]
}

# 2) Prepare data tensors (flatten patches into X_all, y_all)
X_all = torch.tensor(X, dtype=torch.float32)  # X from preprocessing: shape (N,1,H,W)
y_all = torch.tensor(y, dtype=torch.long)

# 3) Set up 5‑fold CV
kf = KFold(n_splits=5, shuffle=True, random_state=42)
device = torch.device("mps")

best_score = 0.0
best_params = None

# 4) Grid search with CV
for params in ParameterGrid(param_grid):
    fold_scores = []
    for train_idx, val_idx in kf.split(X_all):
        # Create loaders for this fold
        train_ds = TensorDataset(X_all[train_idx], y_all[train_idx])
        val_ds   = TensorDataset(X_all[val_idx],   y_all[val_idx])
        train_loader = DataLoader(train_ds, batch_size=params["batch_size"], shuffle=True)
        val_loader   = DataLoader(val_ds,   batch_size=params["batch_size"])

        # Instantiate a fresh model each fold
        model = SeismicCNN().to(device)
        optimizer = optim.Adam(model.parameters(), lr=params["lr"])
        criterion = nn.CrossEntropyLoss()

        # Train for a small number of epochs (e.g. 5) per fold
        for epoch in range(5):
            model.train()
            for Xb, yb in train_loader:
                Xb, yb = Xb.to(device), yb.to(device)
                optimizer.zero_grad()
                out = model(Xb)
                loss = criterion(out, yb)
                loss.backward()
                optimizer.step()

        # Evaluate on validation fold
        model.eval()
        correct, total = 0, 0
        with torch.no_grad():
            for Xb, yb in val_loader:
                Xb, yb = Xb.to(device), yb.to(device)
                logits = model(Xb)
                preds  = logits.argmax(dim=1)
                correct += (preds == yb).sum().item()
                total   += yb.size(0)
        fold_scores.append(correct / total)

    # Compute average CV score
    avg_score = sum(fold_scores) / len(fold_scores)
    print(f"Params {params} → CV accuracy: {avg_score:.3f}")
    if avg_score > best_score:
        best_score  = avg_score
        best_params = params

print(f"\nBest hyperparameters: {best_params} with CV accuracy {best_score:.3f}")

Params {'batch_size': 16, 'lr': 0.001} → CV accuracy: 0.989
Params {'batch_size': 16, 'lr': 0.0005} → CV accuracy: 0.989
Params {'batch_size': 32, 'lr': 0.001} → CV accuracy: 0.989
Params {'batch_size': 32, 'lr': 0.0005} → CV accuracy: 0.989

Best hyperparameters: {'batch_size': 16, 'lr': 0.0005} with CV accuracy 0.989


Training Loop for RoBERTa

In [None]:
from transformers import (
    AutoConfig, AutoTokenizer, AutoModelForSequenceClassification,
    AdamW, get_linear_schedule_with_warmup
)
from torch.utils.data import DataLoader, TensorDataset
from torch.cuda.amp import autocast, GradScaler
from tqdm.auto import tqdm
import torch
TOKENIZERS_PARALLELISM=(True | False)

# 1) Load and configure model/tokenizer
config    = AutoConfig.from_pretrained("roberta-base", num_labels=num_classes)
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model     = AutoModelForSequenceClassification.from_pretrained("roberta-base", config=config)

# 2) Pre‑tokenize once
MAX_LEN = 256
train_enc = tokenizer(X_text_train, padding="max_length", truncation=True,
                      max_length=MAX_LEN, return_tensors="pt")
val_enc   = tokenizer(X_text_val,   padding="max_length", truncation=True,
                      max_length=MAX_LEN, return_tensors="pt")

train_ds = TensorDataset(train_enc.input_ids, train_enc.attention_mask, torch.tensor(y_train))
val_ds   = TensorDataset(val_enc.input_ids,   val_enc.attention_mask,   torch.tensor(y_val))

# 3) DataLoaders with minor speed tweaks
train_loader = DataLoader(train_ds, batch_size=16, shuffle=True,
                          pin_memory=True, num_workers=2)
val_loader   = DataLoader(val_ds,   batch_size=16,
                          pin_memory=True, num_workers=2)

# 4) Device, optimizer, scheduler, scaler
device = (torch.device("cuda") if torch.cuda.is_available()
          else torch.device("mps") if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available()
          else torch.device("cpu"))
print("Using device:", device)
model.to(device)

optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)

total_steps   = len(train_loader) * 3  # 3 epochs
warmup_steps  = int(0.1 * total_steps)
scheduler     = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps,
)

scaler = GradScaler()  # for mixed precision

# 5) Training loop with mixed precision, clipping, scheduler
EPOCHS = 3
for epoch in range(1, EPOCHS+1):
    # — Training —
    model.train()
    train_loss, train_correct = 0.0, 0
    for input_ids, attn_mask, labels in tqdm(train_loader, desc=f"Epoch {epoch}/{EPOCHS} Train"):
        input_ids = input_ids.to(device)
        attn_mask = attn_mask.to(device)
        labels    = labels.to(device)

        optimizer.zero_grad()
        with autocast(enabled=(device.type=="cuda")):
            outputs = model(input_ids=input_ids, attention_mask=attn_mask, labels=labels)
            loss    = outputs.loss

        scaler.scale(loss).backward()
        # gradient clipping
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()

        train_loss    += loss.item() * labels.size(0)
        train_correct += (outputs.logits.argmax(dim=1) == labels).sum().item()

    avg_train_loss = train_loss / len(train_ds)
    train_acc      = train_correct / len(train_ds)

    # — Validation —
    model.eval()
    val_correct = 0
    with torch.no_grad():
        for input_ids, attn_mask, labels in tqdm(val_loader, desc=f"Epoch {epoch}/{EPOCHS} Val"):
            input_ids = input_ids.to(device)
            attn_mask = attn_mask.to(device)
            labels    = labels.to(device)
            logits    = model(input_ids=input_ids, attention_mask=attn_mask).logits
            val_correct += (logits.argmax(dim=1) == labels).sum().item()

    val_acc = val_correct / len(val_ds)

    print(
        f"\nEpoch {epoch}/{EPOCHS} ▶ "
        f"Train loss: {avg_train_loss:.4f}, Train acc: {train_acc:.3f}; "
        f"Val acc: {val_acc:.3f}\n", flush=True
    )


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Using device: mps


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Epoch 1/3 Train:   0%|          | 0/13216 [00:01<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 1/3 Val:   0%|          | 0/3304 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(t


Epoch 1/3 ▶ Train loss: 0.0679, Train acc: 0.985; Val acc: 0.991



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Epoch 2/3 Train:   0%|          | 0/13216 [00:00<?, ?it/s]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Evaluation and Metrics

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# 1) Define distinct validation loaders
cnn_val_loader = DataLoader(val_ds, batch_size=16, shuffle=False)

# Assuming you pre‑tokenized X_text_val into val_enc earlier:
roberta_val_loader = DataLoader(
    TensorDataset(
        val_enc.input_ids,
        val_enc.attention_mask,
        torch.tensor(y_val, dtype=torch.long)
    ),
    batch_size=16,
    shuffle=False
)

# 2) Evaluate CNN
cnn_model.eval()
y_true = y_val  # numpy array
y_pred_cnn, y_score_cnn = [], []
with torch.no_grad():
    for Xb, yb in cnn_val_loader:
        Xb = Xb.to(device)
        logits = cnn_model(Xb)
        probs  = torch.softmax(logits, dim=1)
        preds  = probs.argmax(dim=1)
        y_pred_cnn.extend(preds.cpu().numpy())
        y_score_cnn.extend(probs[:, 1].cpu().numpy())  # class‑1 probability

# 3) Evaluate RoBERTa
roberta_model.eval()
y_pred_rob, y_score_rob = [], []
with torch.no_grad():
    for input_ids, attn_mask, labels in roberta_val_loader:
        input_ids = input_ids.to(device)
        attn_mask = attn_mask.to(device)
        logits    = roberta_model(input_ids=input_ids, attention_mask=attn_mask).logits
        probs     = torch.softmax(logits, dim=1)
        preds     = probs.argmax(dim=1)
        y_pred_rob.extend(preds.cpu().numpy())
        y_score_rob.extend(probs[:, 1].cpu().numpy())

# 4) Compute metrics
acc_cnn    = accuracy_score(y_true,      y_pred_cnn)
prec_cnn   = precision_score(y_true,     y_pred_cnn, average='binary')
recall_cnn = recall_score(y_true,        y_pred_cnn, average='binary')
f1_cnn     = f1_score(y_true,            y_pred_cnn, average='binary')
auc_cnn    = roc_auc_score(y_true,       y_score_cnn)

acc_rob    = accuracy_score(y_true,      y_pred_rob)
prec_rob   = precision_score(y_true,     y_pred_rob, average='binary')
recall_rob = recall_score(y_true,        y_pred_rob, average='binary')
f1_rob     = f1_score(y_true,            y_pred_rob, average='binary')
auc_rob    = roc_auc_score(y_true,       y_score_rob)

print(f"CNN     ▶ Acc: {acc_cnn:.3f}, Precision: {prec_cnn:.3f}, Recall: {recall_cnn:.3f}, F1: {f1_cnn:.3f}, AUC: {auc_cnn:.3f}")
print(f"RoBERTa ▶ Acc: {acc_rob:.3f}, Precision: {prec_rob:.3f}, Recall: {recall_rob:.3f}, F1: {f1_rob:.3f}, AUC: {auc_rob:.3f}")

Plotting CNN and Roberta Training

In [None]:
import matplotlib.pyplot as plt

# --- Safety checks ---
missing = [name for name in ("cnn_train_loss", "cnn_val_loss", "cnn_train_acc", "cnn_val_acc") if name not in globals()]
if missing:
    raise NameError(f"Missing CNN history variables: {missing}. Run the CNN training loop first.")

# roberta history is optional
has_rob = all(name in globals() for name in ("roberta_train_loss", "roberta_val_loss", "roberta_train_acc", "roberta_val_acc"))

# --- Epoch counts ---
num_epochs_cnn = len(cnn_train_loss)
num_epochs_rob = len(roberta_train_loss) if has_rob else 0

# --- Set up subplots ---
# If RoB history exists: 2 rows (CNN, RoB), 2 cols (loss, acc).
# Otherwise: only CNN row.
nrows = 2 if has_rob else 1
fig, axes = plt.subplots(nrows, 2, figsize=(12, 4*nrows))

# Ensure axes is 2D array
if nrows == 1:
    axes = axes[np.newaxis, :]

# --- CNN: Loss ---
axes[0, 0].plot(range(1, num_epochs_cnn+1), cnn_train_loss, label="Train Loss")
axes[0, 0].plot(range(1, num_epochs_cnn+1), cnn_val_loss,   label="Val Loss")
axes[0, 0].set_title("CNN Loss")
axes[0, 0].set_xlabel("Epoch")
axes[0, 0].set_ylabel("Loss")
axes[0, 0].legend()

# --- CNN: Accuracy ---
axes[0, 1].plot(range(1, num_epochs_cnn+1), [a*100 for a in cnn_train_acc], label="Train Acc")
axes[0, 1].plot(range(1, num_epochs_cnn+1), [a*100 for a in cnn_val_acc],   label="Val Acc")
axes[0, 1].set_title("CNN Accuracy (%)")
axes[0, 1].set_xlabel("Epoch")
axes[0, 1].set_ylabel("Accuracy (%)")
axes[0, 1].legend()

if has_rob:
    # --- RoBerta: Loss ---
    axes[1, 0].plot(range(1, num_epochs_rob+1), roberta_train_loss, label="Train Loss")
    axes[1, 0].plot(range(1, num_epochs_rob+1), roberta_val_loss,   label="Val Loss")
    axes[1, 0].set_title("RoBerta Loss")
    axes[1, 0].set_xlabel("Epoch")
    axes[1, 0].set_ylabel("Loss")
    axes[1, 0].legend()

    # --- RoBerta: Accuracy ---
    axes[1, 1].plot(range(1, num_epochs_rob+1), [a*100 for a in roberta_train_acc], label="Train Acc")
    axes[1, 1].plot(range(1, num_epochs_rob+1), [a*100 for a in roberta_val_acc],   label="Val Acc")
    axes[1, 1].set_title("RoBerta Accuracy (%)")
    axes[1, 1].set_xlabel("Epoch")
    axes[1, 1].set_ylabel("Accuracy (%)")
    axes[1, 1].legend()

plt.tight_layout()
plt.show()
