## Step 1: Set Up the CommonsenseQA Dataset

Download and prepare the CommonsenseQA dataset
Split the data into train/validation/test sets if not already done
Understand the format (questions, multiple-choice answers)

## Step 2: Set Up Three Models

Randomly Initialized Transformer

Build a transformer architecture from scratch
Initialize weights randomly
This will serve as your baseline


Pretrained Transformer

Use the same transformer architecture as Model 1
Initialize with pretrained weights (e.g., BERT, RoBERTa)
Make sure this model wasn't specifically trained on CommonsenseQA


Large Language Model (1B+ parameters)

Choose an LLM (e.g., GPT-2, LLaMA, OPT, BLOOM)
No finetuning for this model - just prompt engineering



## Step 3: Training/Finetuning

Finetune Models 1 & 2 on CommonsenseQA train set

Use the same hyperparameters for both
Train for multiple epochs
Save checkpoints and track validation performance


For Model 3 (LLM), develop effective prompts instead of finetuning

## Step 4: Prompt Engineering (for LLM)

Design different prompt formats
Test various instruction styles
Try few-shot examples in prompts
Experiment with temperature and other generation parameters

## Step 5: Evaluation

Evaluate all three models on the test set
Calculate accuracy, F1 score, or other relevant metrics
Compare performance across models

## Step 6: Analysis

Analyze which types of questions each model handles well/poorly
Look at error patterns
Discuss why certain approaches work better

## Step 7: Create Presentation

Summarize methodology
Present results with visualizations
Include discussion of findings
Provide limitations and potential improvements

Technical Requirements:

Programming language: Python recommended
Libraries: PyTorch/TensorFlow, Transformers (Hugging Face), etc.
Computational resources: You'll need GPU access for training

.

.

**Delete steps generated by Claude later**

.

.

.

# Introduction

Import all libraries needed

In [2]:
import os
import time
import copy
from datetime import datetime
from collections import Counter

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as data
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import gensim

from datasets import load_dataset
from huggingface_hub import hf_hub_download

from tqdm import tqdm, trange
import wandb

Setup random seed to ensure reproducibility.

_Info about the seed value: The field of natural language processing began in the 1940s, after World War II. At this time, people recognized the importance of translation from one language to another and hoped to create a machine that could do this sort of translation automatically._

In [3]:
SEED = 1940 # normal: 42

np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In the next step I import and split the dataset. For the split I took off the last 1000 entries from the train-split and used it as validation, the rest of this is of course used for the training. Then I used the validation-part as the test. This was done since the real test-split has no answer keys.

In [4]:
train = load_dataset("tau/commonsense_qa", split="train[:-1000]")
valid = load_dataset("tau/commonsense_qa", split="train[-1000:]")
test = load_dataset("tau/commonsense_qa", split="validation")

print(len(train), len(valid), len(test))

8741 1000 1221


Login for the experiment tracking.

In [5]:
wandb.login()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mfabian-dubach[0m ([33mfabian-dubach-hochschule-luzern[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

# Setup

# Preprocessing

# Model

### Transformer architecture

In [5]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = torch.softmax(attn_scores, dim=-1)
        output = torch.matmul(attn_probs, V)
        return output
        
    def split_heads(self, x):
        batch_size, seq_length, d_model = x.size()
        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)
        
    def combine_heads(self, x):
        batch_size, _, seq_length, d_k = x.size()
        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, self.d_model)
        
    def forward(self, Q, K, V, mask=None):
        Q = self.split_heads(self.W_q(Q))
        K = self.split_heads(self.W_k(K))
        V = self.split_heads(self.W_v(V))
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        output = self.W_o(self.combine_heads(attn_output))
        return output

In [6]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(PositionWiseFeedForward, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

In [7]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length):
        super(PositionalEncoding, self).__init__()
        
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        
        self.register_buffer('pe', pe.unsqueeze(0))
        
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

In [8]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

In [9]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.cross_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = PositionWiseFeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, enc_output, src_mask, tgt_mask):
        attn_output = self.self_attn(x, x, x, tgt_mask)
        x = self.norm1(x + self.dropout(attn_output))
        attn_output = self.cross_attn(x, enc_output, enc_output, src_mask)
        x = self.norm2(x + self.dropout(attn_output))
        ff_output = self.feed_forward(x)
        x = self.norm3(x + self.dropout(ff_output))
        return x

In [10]:
class Transformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout):
        super(Transformer, self).__init__()
        self.encoder_embedding = nn.Embedding(src_vocab_size, d_model)
        self.decoder_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_seq_length)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

        self.fc = nn.Linear(d_model, tgt_vocab_size)
        self.dropout = nn.Dropout(dropout)

    def generate_mask(self, src, tgt):
        src_mask = (src != 0).unsqueeze(1).unsqueeze(2)
        tgt_mask = (tgt != 0).unsqueeze(1).unsqueeze(3)
        seq_length = tgt.size(1)
        nopeak_mask = (1 - torch.triu(torch.ones(1, seq_length, seq_length), diagonal=1)).bool()
        tgt_mask = tgt_mask & nopeak_mask
        return src_mask, tgt_mask

    def forward(self, src, tgt):
        src_mask, tgt_mask = self.generate_mask(src, tgt)
        src_embedded = self.dropout(self.positional_encoding(self.encoder_embedding(src)))
        tgt_embedded = self.dropout(self.positional_encoding(self.decoder_embedding(tgt)))

        enc_output = src_embedded
        for enc_layer in self.encoder_layers:
            enc_output = enc_layer(enc_output, src_mask)

        dec_output = tgt_embedded
        for dec_layer in self.decoder_layers:
            dec_output = dec_layer(dec_output, enc_output, src_mask, tgt_mask)

        output = self.fc(dec_output)
        return output

### Random initialized transformer

In [11]:
import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length=5000):
        super(PositionalEncoding, self).__init__()
        
        # Create positional encoding matrix
        pe = torch.zeros(max_seq_length, d_model)
        position = torch.arange(0, max_seq_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        
        # Register as buffer (not a parameter but should be saved)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
        
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        # Linear projections
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)
        self.out = nn.Linear(d_model, d_model)
        
    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        
        # Linear projections and reshape
        q = self.q_linear(q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.k_linear(k).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.v_linear(v).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        
        # Scaled dot-product attention
        scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.d_k)
        
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
            
        attention = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention, v)
        
        # Reshape and concatenate heads
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.out(output)

class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        super(FeedForward, self).__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(d_ff, d_model)
        
    def forward(self, x):
        return self.linear2(self.dropout(torch.relu(self.linear1(x))))

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # Self attention with residual connection and layer norm
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))
        
        # Feed forward with residual connection and layer norm
        ff_output = self.feed_forward(x)
        x = self.norm2(x + self.dropout(ff_output))
        return x

class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout=0.1):
        super(TransformerEncoder, self).__init__()
        self.d_model = d_model
        
        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model, max_seq_length)
        self.dropout = nn.Dropout(dropout)
        
        # Encoder layers
        self.layers = nn.ModuleList([
            EncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        
        self.norm = nn.LayerNorm(d_model)
        
        # Random initialization is applied automatically when creating layers
        self._init_weights()
        
    def _init_weights(self):
        # This applies PyTorch's default initialization for transformers
        # Xavier/Glorot initialization for attention and linear layers
        # You can add custom initialization here if needed
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)
                
    def forward(self, src, mask=None):
        # src: [batch_size, seq_length]
        
        # Embedding and positional encoding
        src = self.embedding(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        x = self.dropout(src)
        
        # Pass through each encoder layer
        for layer in self.layers:
            x = layer(x, mask)
            
        return self.norm(x)

# Example task-specific head for classification
class TransformerForSequenceClassification(nn.Module):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, d_ff, max_seq_length, num_classes, dropout=0.1):
        super(TransformerForSequenceClassification, self).__init__()
        
        self.encoder = TransformerEncoder(
            vocab_size=vocab_size,
            d_model=d_model,
            num_heads=num_heads,
            num_layers=num_layers,
            d_ff=d_ff,
            max_seq_length=max_seq_length,
            dropout=dropout
        )
        
        # Classification head
        self.classifier = nn.Linear(d_model, num_classes)
        
    def forward(self, x, mask=None):
        # Get encoder output
        encoded = self.encoder(x, mask)
        
        # Use [CLS] token representation (first token) for classification
        # or use mean pooling over sequence length
        pooled = encoded.mean(dim=1)  # Mean pooling
        
        # Apply classifier
        return self.classifier(pooled)

# Training

Example for training

In [None]:
# Define hyperparameters
vocab_size = 30000  # Depends on your tokenizer
d_model = 768      # Embedding dimension
num_heads = 12     # Number of attention heads
num_layers = 6     # Number of transformer layers
d_ff = 3072        # Feed-forward dimension
max_seq_length = 512  # Maximum sequence length
num_classes = 2    # Number of output classes (for your task)

In [None]:
# Initialize the model with random weights
model = TransformerForSequenceClassification(
    vocab_size=vocab_size,
    d_model=d_model,
    num_heads=num_heads,
    num_layers=num_layers,
    d_ff=d_ff,
    max_seq_length=max_seq_length,
    num_classes=num_classes
)

In [None]:
# Now train the model from scratch
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

In [None]:
# Training loop
for epoch in range(num_epochs):
    for batch in train_dataloader:
        inputs, labels = batch
        
        # Forward pass
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    # Evaluation, logging, etc.

# Evaluation

Important: Use test split for eval, not validation (& ofc no train)

# Interpretation

# Tools used

## **Adjust this section before submitting**

1. **Programming Environment**
   - Python 3.12.8
   - Jupyter Notebook

2. **Machine Learning and Deep Learning**
   - PyTorch (neural network development)
   - Hugging Face Datasets (data management)
   - NLTK (natural language preprocessing)
   - FastText (pre-trained word embeddings, 300-dimensional vectors)

3. **Data Manipulation and Analysis**
   - NumPy (numerical computing)
   - Pandas (data structuring and manipulation)
   - Scikit-learn (potential additional machine learning utilities)

4. **Visualization and Tracking**
   - Matplotlib (basic plotting)
   - Seaborn (statistical data visualization)
   - Weights & Biases (experiment tracking and logging)
     * Tracked metrics: training loss, accuracy, learning rates
     * Logged hyperparameter configurations
     * Enabled comparative analysis across model runs

5. **Computational Infrastructure**
   - CUDA-enabled GPU acceleration
   - GPU-optimized PyTorch operations
   - Efficient parallel computing for model training

6. **Dataset and Benchmarking**
   - CommonsenseQA dataset (Hugging Face)
   - Standard benchmark for commonsense reasoning tasks

7. **Additional Libraries**
   - Gensim (word vector processing)
   - tqdm (progress bar visualization)
   - datetime (experiment timestamping)

8. **AI-Tools**
   - Claude 3.5 Sonnet: Utilized as a coding assistant for debugging, optimization and documentation.
   - GPT-4-turbo: Assisted in drafting and refining documentation, helping with structure and phrasing.
   - Copilot: Used for quick inserts, when recommendation was suitable for what I was planning to do.

9. **Sources**
   - Transformer architecture: https://medium.com/data-science/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb