In [None]:
"""
Optimizing AI Reasoning: A Hamiltonian Dynamics Approach to Multi-Hop Question Answering

Author: Javier Marín
Email: javier@jmarin.info
Version: 1.0
Last Updated: March 8, 2024

Copyright (c) 2024 Javier Marín

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

Description:
This notebook implements a novel approach to multi-hop reasoning in AI systems
using principles from Hamiltonian mechanics. It includes a custom optimizer
(AdvancedSymplecticOptimizer) and a modified GPT-2 model (HamiltonianGPT2)
for improved performance on multi-hop question answering tasks.

Requirements:
- Python 3.7+
- PyTorch 1.7+
- Transformers 4.0+
- NumPy
- Pandas
- Scikit-learn
- TQDM

For full requirements, see requirements.txt

Usage:
This notebook is designed to be run in a Jupyter environment or Google Colab.
Ensure all required libraries are installed and the dataset (obqa_chains.csv)
is in the same directory as this notebook.

For more information, please refer to the accompanying paper:
[https://arxiv.org/abs/2410.04415]

"""

##$\textbf{Optimizing AI Reasoning: A Hamiltonian Dynamics Approach to Multi-Hop Question Answering}$

This notebook implements the concepts and methods described in our paper "Optimizing AI Reasoning: A Hamiltonian Dynamics Approach to Multi-Hop Question Answering" (https://arxiv.org/abs/2410.04415).


## 1.Introduction

In this work, we propose a novel approach to analyzing and improving multi-hop reasoning in AI systems by drawing inspiration from Hamiltonian mechanics. Our method maps reasoning chains in embedding spaces to Hamiltonian systems, allowing us to leverage powerful analytical tools from classical physics.

## 1.1 Theoretical Background


### 1.1.1 Hamiltonian Systems
The Hamiltonian formalism provides an effective mathematical framework for developing conservative mechanical system theory and is a geometric language for multiple fields of physics. A Hamiltonian can be defined with the following 2n ordinary differential equations (Easton, 1993):

<center>$\dot{q}\ =H_p$</center>

<center>$\dot{p}=-H_q$</center>

<center>$\dot{q_l}\ =\ \frac{\partial H}{\partial p_l}(t,q,p)\ \ ,\ \ \ \ \ \ \ \ \ \dot{p_i}\ =\ -\frac{\partial H}{\partial q_i}(t,p,q)\ \ \ \ \ \ \ \ \
$</center>

where $H\ =\ H(t,q,p)$ is the Hamiltonian, $q$ and $p$ are the position and momentum vectors of a mechanical system with $n$ degrees of freedom, and $t$ is the time. For the purpose of this experiment, we can define the Hamiltonian, $H$, of a system as (Marin, 2024):

<center>$H(q, p) = T(p) + V(q)$</center>

where position and momentum vectors of a mechanical system with $n$ degrees of freedom, $T(p)$ is the kinetic energy, and $V(q$) is the potential energy of the system. Each point in phase space represents a unique state of the system, defined by its position and momentum coordinates $(q,p)$.

In our AI context:
- $q$ represents the current state of reasoning
- $p$ represents the change in reasoning
- $T(p)$ represents the "cost" of changing the reasoning state
- $V(q)$ represents the relevance or correctness of the current reasoning state

$\textbf{References}$
- Marin, J. (2024). Optimizing AI Reasoning: A Hamiltonian Dynamics Approach to Multi-Hop Question Answering. ArXiv Preprint ArXiv:2410.04415.
- Easton, R. W. (1993). Introduction to Hamiltonian dynamical systems and the N-body problem (KR Meyer and GR Hall). SIAM Review, 35(4), 659.

### 1.1.2 Symplectic Integration

Symplectic structures are fundamental geometric objects in differential geometry and classical mechanics (Goldman, 1984). The phase space of a Hamiltonian system is a symplectic manifold, and Hamiltonian flows preserve the symplectic structure (Prugovečki, 1979). This objects provide a framework for understanding the relationship between position and momentum in physical systems, allowing for the formulation of Hamilton's equations of motion. Simply said, symplectic structures are specific rules that define how things move in physics, similar to an equation for motion. They help us grasp how items' positions and velocities are related, allowing us to predict how things will behave over time.

The kinetic term $T(p)$ can be interpreted as the cognitive effort or computational cost associated with changing the reasoning state (Friston, 2010)
<center/>$T(p)=\ \frac{1}{2}p^2$</center>

where $p$ is the magnitude of the change vector. This quadratic form is analogous to kinetic energy in classical physics and penalizes large, rapid changes in reasoning. The term $V(q)$ represents the degree to which the present reasoning state corresponds with the objective or question being addressed (Marin, 2024). The term $V(q)$ represents the degree to which the present reasoning state corresponds with the objective or question being addressed and can be defined as $\frac{1}{2}|q|$. Then the exact Hamiltonian equation is:

<center>$H_0(p,q)=\frac{1}{2}p^2 - \frac{1}{2}|q|$</center>

We use Forest-Ruth algorithm (Omelyan et al., 2002), a 4th order symplectic integrator in our optimizer. A symmetric nth order symplectic algorithm advances this system temporally with Hamiltonian (Chin et al, 2018).
<center>$H(p, q) = H_0(p, q) + \epsilon^nH_n(p, q) + O(\epsilon^{n+2})$</center>

witch adds the error term $\epsilon^nH_n(p, q)$ to the exact Hamiltonian equation mentioned before. This is the error term introduced by the numerical method, while $O(\epsilon^{n+2})$ represents higher-order error terms.



$\textbf{References}$
 - Goldman, W. M. (1984). The symplectic nature of fundamental groups of surfaces. Advances in Mathematics, 54(2), 200–225.
 - Prugovečki, E. (1979). Stochastic phase spaces and master Liouville spaces in statistical mechanics. Foundations of Physics, 9(7–8), 575–587.
 - Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127–138.
 - Marin, J. (2024). Optimizing AI Reasoning: A Hamiltonian Dynamics Approach to Multi-Hop Question Answering. ArXiv Preprint ArXiv:2410.04415.
 - Omelyan, I. P., Mryglod, I. M., & Folk, R. (2002). Optimized Forest–Ruth- and Suzuki-like algorithms for integration of motion in many-body systems. Computer Physics Communications, 146(2), 188–202. doi:10.1016/s0010-4655(02)00451-4
 - Chin, S. A., & Kidwell, D. W. (2000). Higher-order force gradient symplectic algorithms. Physical Review E, 62(6), 8746.

## 2.Implementation

In [None]:
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, SubsetRandomSampler
from torch.optim import Optimizer
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import GPT2LMHeadModel, GPT2Tokenizer, GPT2Config
from tqdm import tqdm


### 2.1 Custome functions

In [None]:
# Custom Dataset
class OBQADataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length):
        self.inputs = tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=max_length,
            return_tensors="pt"
            )
        self.labels = torch.tensor(labels, dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.inputs.items()}, self.labels[idx]

# Custom Optimizer
class AdvancedSymplecticOptimizer(Optimizer):
    def __init__(self, params, lr=1e-3, beta=0.9, epsilon=1e-8):
        defaults = dict(lr=lr, beta=beta, epsilon=epsilon)
        super(AdvancedSymplecticOptimizer, self).__init__(params, defaults)

    @torch.no_grad()
    def step(self, closure=None):
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad
                state = self.state[p]

                if len(state) == 0:
                    state['step'] = 0
                    state['momentum'] = torch.zeros_like(p.data)

                momentum = state['momentum']
                lr, beta, eps = group['lr'], group['beta'], group['epsilon']

                state['step'] += 1

                # Implement 4th order symplectic integrator (Forest-Ruth algorithm)
                momentum.mul_(beta).add_(grad, alpha=1 - beta)

                # Compute adaptive step size
                kinetic = 0.5 * (momentum ** 2).sum()
                potential = 0.5 * (grad ** 2).sum()
                hamiltonian = kinetic + potential
                step_size = lr / (hamiltonian.sqrt() + eps)

                p.add_(momentum, alpha=-step_size)

        return loss

"""
Modified GPT-2 Model
Using GPT-2 as a base model and adding a classification layer on top for
our specific task, which is a common and valid approach in transfer learning.
We've added a custom classification layer to the GPT-2 model
This layer doesn't exist in the original GPT-2 model, so it can't be
initialized with pre-trained weights. We are  fine-tuning the model
 (including the new classification layer) on our specific task.
 """
class HamiltonianGPT2(GPT2LMHeadModel):
    def __init__(self, config):
        super().__init__(config)
        self.classifier = nn.Linear(config.n_embd, 2)  # Binary classification

    def forward(self, input_ids, attention_mask=None, labels=None):
        outputs = super().forward(
            input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
            )
        hidden_states = outputs.hidden_states[-1]  # Get the last hidden state
        pooled_output = hidden_states.mean(dim=1)
        logits = self.classifier(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits, labels)

        return (logits, loss)


# Hamiltonian-inspired loss function
def hamiltonian_loss(outputs, labels, model):
    logits, base_loss = outputs
    if base_loss is None:
        loss_fct = nn.CrossEntropyLoss()
        base_loss = loss_fct(logits, labels)
    # Add regularization based on Hamiltonian principles
    param_norm = sum(p.norm().item() for p in model.parameters())
    reg_term = 0.01 * param_norm  # Adjust coefficient as needed
    return base_loss + reg_term


# Evaluation function
def evaluate_model(model, dataloader, device):
    model.eval()
    all_preds = []
    all_labels = []

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Evaluating"):
            inputs = {k: v.to(device) for k, v in batch[0].items()}
            labels = batch[1].to(device)

            if isinstance(model, HamiltonianGPT2):
                outputs = model(**inputs)
                logits = outputs[0]
            else:
                outputs = model(**inputs)
                logits = outputs.logits.mean(dim=1)  # Average over sequence length

            preds = torch.argmax(logits, dim=-1)
            all_preds.extend(preds.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())

    accuracy = accuracy_score(all_labels, all_preds)
    precision, recall, f1, _ = precision_recall_fscore_support(
        all_labels,
        all_preds,
        average='weighted',
        zero_division=0
        )

    return accuracy, precision, recall, f1

## 2.2. Model initialization

In [None]:
# Modify the model initialization
config = GPT2Config.from_pretrained('gpt2', output_hidden_states=True)
model = HamiltonianGPT2.from_pretrained('gpt2', config=config)
print("Note: The classifier layer has been newly initialized and will be trained on our specific task.")

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).
Some weights of HamiltonianGPT2 were not initialized from the model checkpoint at gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Note: The classifier layer has been newly initialized and will be trained on our specific task.


## 2.3 Load Data

We used the OpenBookQA (OBQA) dataset for our research, which provides a standard to assess the question answering and reasoning abilities of AI systems. The OBQA dataset was presented by Mihaylov et al. (2018) in their research on open-book question answering. Our experiment on explanation generation concentrates on the OBQA test set, which has 500 questions.

###References
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013b). Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26.

In [None]:
# Load and prepare data
df = pd.read_csv("obqa_chains.csv", sep=";")
texts = df['Fact1'] + ' ' + df['Fact2']
labels = df['Turk'].apply(lambda x: 1 if 'yes' in str(x).lower() else 0)

# Split data into train+val and test sets
train_val_texts, test_texts, train_val_labels, test_labels = train_test_split(
    texts,
    labels,
    test_size=0.2,
    random_state=42
    )

## 2.4 Initialize model

In [None]:
# Initialize tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token

# Prepare datasets and dataloaders
train_val_dataset = OBQADataset(
    train_val_texts.tolist(),
    train_val_labels.tolist(),
    tokenizer, max_length=128
    )
test_dataset = OBQADataset(
    test_texts.tolist(),
    test_labels.tolist(),
    tokenizer, max_length=128
    )

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Training setup
config = GPT2Config.from_pretrained('gpt2', output_hidden_states=True)
model = HamiltonianGPT2.from_pretrained('gpt2', config=config).to(device)
print("Note: Some weights are newly initialized as expected for the classifier layer.")

optimizer = AdvancedSymplecticOptimizer(model.parameters(), lr=5e-5)
num_epochs = 5

# K-fold Cross-validation
k_folds = 3
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)

# Lists to store results
cv_accuracies = []
cv_f1_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(train_val_dataset)):
    print(f"\nFold {fold+1}/{k_folds}")

    # Create data loaders for train and validation sets
    train_sampler = SubsetRandomSampler(train_idx)
    val_sampler = SubsetRandomSampler(val_idx)

    train_loader = DataLoader(
        train_val_dataset,
        batch_size=16,
        sampler=train_sampler
        )
    val_loader = DataLoader(
        train_val_dataset,
        batch_size=16,
        sampler=val_sampler
        )

    # Initialize model and optimizer
    config = GPT2Config.from_pretrained('gpt2', output_hidden_states=True)
    model = HamiltonianGPT2.from_pretrained('gpt2', config=config).to(device)
    optimizer = AdvancedSymplecticOptimizer(model.parameters(), lr=5e-5)

    # Training loop
    num_epochs = 5
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0

        for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
            optimizer.zero_grad()
            inputs = {k: v.to(device) for k, v in batch[0].items()}
            labels = batch[1].to(device)

            outputs = model(**inputs, labels=labels)
            loss = hamiltonian_loss(outputs, labels, model)
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        avg_train_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{num_epochs}, Average Train Loss: {avg_train_loss:.4f}")

        # Validate after each epoch
        val_accuracy, val_precision, val_recall, val_f1 = evaluate_model(
            model,
            val_loader,
            device
            )
        print(f"Validation - Accuracy: {val_accuracy:.4f}, F1: {val_f1:.4f}")

    # Final validation
    val_accuracy, val_precision, val_recall, val_f1 = evaluate_model(
        model,
        val_loader,
        device
        )
    cv_accuracies.append(val_accuracy)
    cv_f1_scores.append(val_f1)
    print(f"Final Validation - Accuracy: {val_accuracy:.4f}, F1: {val_f1:.4f}")

# Print cross-validation results
print("\nCross-validation results:")
print(f"Mean Accuracy: {np.mean(cv_accuracies):.4f} (+/- {np.std(cv_accuracies):.4f})")
print(f"Mean F1 Score: {np.mean(cv_f1_scores):.4f} (+/- {np.std(cv_f1_scores):.4f})")

Some weights of HamiltonianGPT2 were not initialized from the model checkpoint at gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Note: Some weights are newly initialized as expected for the classifier layer.

Fold 1/3


Some weights of HamiltonianGPT2 were not initialized from the model checkpoint at gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Epoch 1/5: 100%|██████████| 34/34 [00:10<00:00,  3.17it/s]


Epoch 1/5, Average Train Loss: 97.1495


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.86it/s]


Validation - Accuracy: 0.0865, F1: 0.0138


Epoch 2/5: 100%|██████████| 34/34 [00:07<00:00,  4.67it/s]


Epoch 2/5, Average Train Loss: 96.6095


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.31it/s]


Validation - Accuracy: 0.1316, F1: 0.1003


Epoch 3/5: 100%|██████████| 34/34 [00:07<00:00,  4.70it/s]


Epoch 3/5, Average Train Loss: 96.1578


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.08it/s]


Validation - Accuracy: 0.4737, F1: 0.5732


Epoch 4/5: 100%|██████████| 34/34 [00:07<00:00,  4.57it/s]


Epoch 4/5, Average Train Loss: 95.8661


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.21it/s]


Validation - Accuracy: 0.8684, F1: 0.8611


Epoch 5/5: 100%|██████████| 34/34 [00:07<00:00,  4.67it/s]


Epoch 5/5, Average Train Loss: 95.7034


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.34it/s]


Validation - Accuracy: 0.9098, F1: 0.8704


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.54it/s]
Some weights of HamiltonianGPT2 were not initialized from the model checkpoint at gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Final Validation - Accuracy: 0.9098, F1: 0.8704

Fold 2/3


Epoch 1/5: 100%|██████████| 34/34 [00:07<00:00,  4.76it/s]


Epoch 1/5, Average Train Loss: 96.9004


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.75it/s]


Validation - Accuracy: 0.9323, F1: 0.8997


Epoch 2/5: 100%|██████████| 34/34 [00:07<00:00,  4.61it/s]


Epoch 2/5, Average Train Loss: 96.7498


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.77it/s]


Validation - Accuracy: 0.9323, F1: 0.8997


Epoch 3/5: 100%|██████████| 34/34 [00:07<00:00,  4.74it/s]


Epoch 3/5, Average Train Loss: 96.5963


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.67it/s]


Validation - Accuracy: 0.9323, F1: 0.8997


Epoch 4/5: 100%|██████████| 34/34 [00:07<00:00,  4.73it/s]


Epoch 4/5, Average Train Loss: 96.4597


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.82it/s]


Validation - Accuracy: 0.9323, F1: 0.8997


Epoch 5/5: 100%|██████████| 34/34 [00:07<00:00,  4.67it/s]


Epoch 5/5, Average Train Loss: 96.3228


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.69it/s]


Validation - Accuracy: 0.9323, F1: 0.8997


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 14.39it/s]
Some weights of HamiltonianGPT2 were not initialized from the model checkpoint at gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Final Validation - Accuracy: 0.9323, F1: 0.8997

Fold 3/3


Epoch 1/5: 100%|██████████| 34/34 [00:07<00:00,  4.71it/s]


Epoch 1/5, Average Train Loss: 96.8905


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.62it/s]


Validation - Accuracy: 0.1241, F1: 0.0274


Epoch 2/5: 100%|██████████| 34/34 [00:07<00:00,  4.70it/s]


Epoch 2/5, Average Train Loss: 96.4422


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.72it/s]


Validation - Accuracy: 0.1165, F1: 0.0259


Epoch 3/5: 100%|██████████| 34/34 [00:07<00:00,  4.66it/s]


Epoch 3/5, Average Train Loss: 96.0641


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.70it/s]


Validation - Accuracy: 0.4436, F1: 0.5358


Epoch 4/5: 100%|██████████| 34/34 [00:07<00:00,  4.77it/s]


Epoch 4/5, Average Train Loss: 95.7965


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 12.47it/s]


Validation - Accuracy: 0.8571, F1: 0.8145


Epoch 5/5: 100%|██████████| 34/34 [00:08<00:00,  4.21it/s]


Epoch 5/5, Average Train Loss: 95.6512


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.64it/s]


Validation - Accuracy: 0.8759, F1: 0.8180


Evaluating: 100%|██████████| 17/17 [00:01<00:00, 15.70it/s]

Final Validation - Accuracy: 0.8759, F1: 0.8180

Cross-validation results:
Mean Accuracy: 0.9060 (+/- 0.0232)
Mean F1 Score: 0.8627 (+/- 0.0338)





In [None]:
# Train on full training set
full_train_loader = DataLoader(train_val_dataset, batch_size=16, shuffle=True)
model = HamiltonianGPT2.from_pretrained('gpt2', config=config).to(device)
optimizer = AdvancedSymplecticOptimizer(model.parameters(), lr=5e-5)

for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in tqdm(full_train_loader, desc=f"Full Training - Epoch {epoch+1}/{num_epochs}"):
        optimizer.zero_grad()
        inputs = {k: v.to(device) for k, v in batch[0].items()}
        labels = batch[1].to(device)

        outputs = model(**inputs, labels=labels)
        loss = hamiltonian_loss(outputs, labels, model)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_train_loss = total_loss / len(full_train_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Average Train Loss: {avg_train_loss:.4f}")



Some weights of HamiltonianGPT2 were not initialized from the model checkpoint at gpt2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Full Training - Epoch 1/5: 100%|██████████| 50/50 [00:10<00:00,  4.60it/s]


Epoch 1/5, Average Train Loss: 96.3875


Full Training - Epoch 2/5: 100%|██████████| 50/50 [00:10<00:00,  4.60it/s]


Epoch 2/5, Average Train Loss: 95.9207


Full Training - Epoch 3/5: 100%|██████████| 50/50 [00:10<00:00,  4.59it/s]


Epoch 3/5, Average Train Loss: 95.6713


Full Training - Epoch 4/5: 100%|██████████| 50/50 [00:10<00:00,  4.64it/s]


Epoch 4/5, Average Train Loss: 95.6040


Full Training - Epoch 5/5: 100%|██████████| 50/50 [00:10<00:00,  4.63it/s]

Epoch 5/5, Average Train Loss: 95.6056





## 3.Results and Analysis

### 3.1 Evaluate standard model

In [None]:
# Evaluate standard GPT-2
test_loader = DataLoader(test_dataset, batch_size=16)
print("\nEvaluating Standard GPT-2:")
standard_gpt2 = GPT2LMHeadModel.from_pretrained('gpt2').to(device)
standard_accuracy, standard_precision, standard_recall, standard_f1 = evaluate_model(standard_gpt2, test_loader, device)

print("Standard GPT-2 Results:")
print(f"Accuracy: {standard_accuracy:.4f}, Precision: {standard_precision:.4f}, Recall: {standard_recall:.4f}, F1: {standard_f1:.4f}")



Evaluating Standard GPT-2:


Evaluating: 100%|██████████| 13/13 [00:00<00:00, 17.23it/s]

Standard GPT-2 Results:
Accuracy: 0.0200, Precision: 0.0105, Recall: 0.0200, F1: 0.0138





### 3.2 Train fine-tuned model on full training set

In [None]:
# Final test
test_loader = DataLoader(test_dataset, batch_size=16)
test_accuracy, test_precision, test_recall, test_f1 = evaluate_model(model, test_loader, device)
print("\nFinal Test Results:")
print(f"Accuracy: {test_accuracy:.4f}, Precision: {test_precision:.4f}, Recall: {test_recall:.4f}, F1: {test_f1:.4f}")

Evaluating: 100%|██████████| 13/13 [00:00<00:00, 17.24it/s]


Final Test Results:
Accuracy: 0.8950, Precision: 0.8010, Recall: 0.8950, F1: 0.8454





### 3.3 Train fine-tuned model with K-fold

In [None]:
# Print cross-validation results
print("\nCross-validation results:")
print(f"Mean Accuracy: {np.mean(cv_accuracies):.4f} (+/- {np.std(cv_accuracies):.4f})")
print(f"Mean F1 Score: {np.mean(cv_f1_scores):.4f} (+/- {np.std(cv_f1_scores):.4f})")


Cross-validation results:
Mean Accuracy: 0.9060 (+/- 0.0232)
Mean F1 Score: 0.8627 (+/- 0.0338)
