## Step 1: Set Up the CommonsenseQA Dataset

Download and prepare the CommonsenseQA dataset
Split the data into train/validation/test sets if not already done
Understand the format (questions, multiple-choice answers)

## Step 2: Set Up Three Models

Randomly Initialized Transformer

Build a transformer architecture from scratch
Initialize weights randomly
This will serve as your baseline


Pretrained Transformer

Use the same transformer architecture as Model 1
Initialize with pretrained weights (e.g., BERT, RoBERTa)
Make sure this model wasn't specifically trained on CommonsenseQA


Large Language Model (1B+ parameters)

Choose an LLM (e.g., GPT-2, LLaMA, OPT, BLOOM)
No finetuning for this model - just prompt engineering



## Step 3: Training/Finetuning

Finetune Models 1 & 2 on CommonsenseQA train set

Use the same hyperparameters for both
Train for multiple epochs
Save checkpoints and track validation performance


For Model 3 (LLM), develop effective prompts instead of finetuning

## Step 4: Prompt Engineering (for LLM)

Design different prompt formats
Test various instruction styles
Try few-shot examples in prompts
Experiment with temperature and other generation parameters

## Step 5: Evaluation

Evaluate all three models on the test set
Calculate accuracy, F1 score, or other relevant metrics
Compare performance across models

## Step 6: Analysis

Analyze which types of questions each model handles well/poorly
Look at error patterns
Discuss why certain approaches work better

## Step 7: Create Presentation

Summarize methodology
Present results with visualizations
Include discussion of findings
Provide limitations and potential improvements

Technical Requirements:

Programming language: Python recommended
Libraries: PyTorch/TensorFlow, Transformers (Hugging Face), etc.
Computational resources: You'll need GPU access for training

.

.

.

TODO: Checkpointing, Early stopping works?, Log to Wandb, sweeps or other auto tool (optional), llm

**Delete steps generated by Claude later**

.

.

.

# **FS25 NLP Project 1: Word Embeddings/Recurrent Neural Networks**

Fabian Dubach

# **Introduction**

<style>
  .container {
    display: flex;
    align-items: flex-start;
    gap: 20px; /* spacing between text and ASCII art */
    font-family: monospace;
  }
  .text {
    flex: 2;
  }
  .ascii {
    white-space: pre;
    font-size: 4.5px;
    line-height: 1.2;
    flex: 1;
  }
</style>

<div class="container">
  <div class="text">
    <p>The task for my project was to perform common sense question answering using the CommonsenseQA dataset.</p><br>
    <p>I evaluated the performance of three different Transformer-based models:</p>
    <p>1. A randomly initialized Transformer</p>
    <p>2. A pretrained Transformer (with the same architecture as the first Transformer)</p>
    <p>3. A large language model (LLM) with over 1 billion parameters</p><br>
    <p>While the first two models were finetuned on the dataset using the same hyperparameters for a fair comparison, the LLM was evaluated through prompt engineering without additional training. This setup allowed me to explore how different levels of pretraining and model scale impact common sense reasoning performance.</p>
    <p>We had to also track the trainings with Wandb (workspace URL: <a href="https://wandb.ai/fabian-dubach-hochschule-luzern/CommonsenseQA/workspace?nw=nwuserfabiandubach" target="_blank">https://wandb.ai/fabian-dubach-hochschule-luzern/CommonsenseQA/workspace?nw=nwuserfabiandubach</a>).</p>
  </div>

  <div class="ascii">
<pre>
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣶⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢰⠀⠀⠀⠀⠀⣤⣤⣤⠀⠀⠀⠀⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡇⠀⣠⡶⢿⡇⢿⣿⡏⢳⣦⠀⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣾⡛⣆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣧⣼⣿⣴⣋⡽⠮⠿⢭⣟⣏⣷⣿⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢻⣧⠘⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡼⣇⣿⡿⠶⣶⣿⣟⡛⣷⣿⢠⠙⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡈⣏⠇⢹⡀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡟⢹⠁⣿⠋⠉⢹⠉⠙⣿⡇⣾⣀⣾⠀⢀⣤⡀⢀⡀⠀⠀⢀⣠⣴⣾⠛⢻⡛⢻⡄⢀⣳⡀⢀⣠⠄⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣾⣷⣾⢀⣿⡇⠀⠸⠀⠀⣿⣧⡽⠿⣟⣺⣭⠴⢿⡏⣩⣷⡾⢛⣭⣴⣿⣇⠘⣿⣷⣿⡛⠉⢻⣟⣷⠄⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⠿⢿⣟⣿⣿⡦⣶⣪⡭⠿⣚⣫⣭⣽⣶⡄⠀⢸⡇⣿⡙⣿⣿⣿⣿⣿⣿⣆⠹⣿⣿⣷⡀⠀⢿⡉⠁⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣀⣀⣤⣶⣿⠿⠛⣉⣭⣶⣾⣿⠿⠟⠛⠉⠉⢻⠀⢸⣷⣿⣇⢻⡿⣿⣿⣿⣿⠟⠀⠹⣿⣿⠃⠀⠘⣷⡀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣤⣦⣼⣿⠿⠛⣋⡁⣼⢠⣿⡿⠛⠉⠁⠀⠀⢀⡀⢀⣴⣾⠀⢸⣿⡇⢻⡄⠙⠿⠻⠛⠁⠀⢀⣠⣽⣿⣇⡀⠀⠸⣧⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣾⠿⣛⣭⣴⡾⠟⠛⣧⣿⢸⡿⠀⠀⠀⠀⣰⣿⣿⣷⣾⣿⣿⠀⢸⡏⣇⢸⣷⡀⠀⢀⣠⣴⣾⠿⠛⣿⢻⣿⣹⡀⠀⢻⣆⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣴⡟⣦⠀⠀⠀⢀⡿⣵⡿⠛⠉⣡⣶⣤⣄⣿⣯⢸⣇⠀⠀⢠⣾⣿⡿⣿⣿⣿⣿⡿⠀⢸⡇⢻⡼⣿⣷⣶⠿⠛⠉⠀⠀⠀⠸⡇⣿⣿⣧⠀⠘⣿⡀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⡇⢹⠀⢀⣠⣼⣿⣿⠀⢀⣼⣿⣿⣿⣿⡇⣿⢸⣿⣀⣀⣿⡿⠿⠶⠚⠛⠉⠉⠀⠀⢸⡇⠀⢻⣾⣝⣿⡆⠀⢀⣠⡴⠖⠛⢻⡾⣿⣿⣆⠀⢹⡇⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣇⣼⡾⠟⠋⣿⢻⣇⣤⣌⠻⢿⣿⣿⣿⠃⢿⠀⠉⠉⠁⠀⠀⠀⣀⣤⡤⠶⠶⠒⠚⣻⣷⣄⠈⣿⣿⣿⣿⡞⠉⠀⠀⠀⠀⠀⣿⢿⣿⣾⣋⣽⠇⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣹⠏⠀⠀⠀⣿⢿⣿⣿⣯⡴⠾⠛⢋⣡⠶⠛⠛⠋⣉⣉⣉⣙⢻⣿⠀⠀⠀⠀⠀⢠⡟⠀⠈⠻⢦⣈⣿⣿⣧⠀⠀⢀⣠⣴⡾⢿⣿⣿⣿⣿⣿⡀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⡟⣿⡟⠀⠀⠀⣿⠈⠋⠉⢀⣠⠴⣛⣩⣤⣶⣞⣭⣿⢿⣿⣿⣻⣼⣿⣆⣀⣤⣤⣴⣿⣄⣠⣶⣦⣀⣙⣿⣿⣿⡶⣿⠟⠋⣁⣶⠟⢻⣽⣿⣿⣿⠇⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⢠⣿⣇⠀⠀⠀⢹⣠⡴⠖⢻⣷⢫⣿⣿⣿⣯⣿⣟⣿⣿⣭⣽⣿⡿⣿⣿⣿⠿⠿⢿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡿⣿⠋⠉⣿⠀⢸⣿⣿⣿⣿⣷⡀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣼⣿⣿⣤⣴⣾⢿⡅⠀⣀⣾⢿⣿⣿⣿⣿⣿⣿⡿⣿⣷⣿⣿⣿⡇⣿⣿⡇⠀⠀⢸⣿⣿⡟⢿⣿⣿⣿⣿⣿⣣⣿⠁⣿⣀⣤⡿⠀⢀⣿⣿⣿⣿⣿⡇⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡇⠻⣿⠛⠉⠀⠈⣿⠛⢽⣿⢻⣿⣿⢿⣿⣿⣿⡇⣿⠿⣶⣶⣚⣧⣿⣿⡇⠀⠀⣸⣿⣿⣿⣄⣈⢿⣿⢿⣷⣿⣿⠀⠉⠉⠀⠀⠀⠘⡇⣿⣿⣿⣿⡇⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡇⡀⣷⡆⠀⠀⠀⠸⣧⣻⣿⢸⣿⣿⡿⢿⣾⣻⡇⣿⣿⣿⣿⣿⣿⣿⠿⠷⠾⠛⠛⠿⢿⣿⣿⣿⣄⣿⠿⠋⢸⣿⠀⠀⠀⠀⠀⠀⠀⡇⣿⣿⣿⣿⣿⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣷⡇⣿⡇⠀⠀⠀⠀⣿⣿⣿⡾⢿⣿⣿⣿⣿⡶⠷⠾⠛⠛⠉⠁⢀⣠⠤⠴⠒⡆⢠⠀⢰⡉⠻⣿⣽⡏⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⡇⣿⡿⣿⣿⣿⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣧⣿⠿⢀⣀⣤⣴⣿⣿⣿⡷⠾⠛⠋⠉⢀⣀⣠⠤⠴⠒⠻⡆⢸⠀⠀⢀⡠⠇⠸⡄⠈⣇⠀⠈⡻⢦⡀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⡇⣿⣧⡘⠿⢻⡆
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠻⣆⣿⣿⣿⣿⣿⡿⠛⣉⣀⡀⣠⠴⠒⠋⠉⠁⠀⠀⠀⠀⠀⡇⢸⣠⠴⣫⡄⠀⠀⡇⠀⢹⠀⠀⣿⠦⢿⡀⢸⡇⠀⠀⣀⣤⣤⣿⠀⡇⣿⣿⣿⣆⢸⡇
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣿⢿⡟⣽⣿⠀⣏⠁⠀⡇⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣇⠀⡖⣻⠋⠀⠀⠈⢻⠀⢈⡇⠀⠸⡄⠘⣧⢸⡇⠀⢸⣷⣾⣿⠏⠀⡇⣿⣿⣿⣿⢸⡇
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣾⠏⠛⠋⢡⣿⠀⠸⣿⣟⡃⣇⠀⠀⠀⠀⠀⣀⣠⡤⠶⠒⠋⠀⠛⠁⠀⣀⣤⣶⣿⣿⣿⣿⣷⣤⡈⠁⢻⡞⣿⠀⠈⠻⣴⠏⠀⠀⠿⢹⣿⣎⢻⣿⡇
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣾⡟⠀⠀⢀⡿⣿⠀⠀⠈⠳⡇⠻⠤⠶⠚⠋⠉⠁⠀⠀⠀⠀⠀⣀⣤⣶⣿⣿⣿⣿⣿⠿⠛⠻⣿⣿⣿⣷⣜⣷⣿⠀⠀⢀⣀⣤⣤⣶⣾⣶⣿⣿⠃⢸⡇
⠀⠀⠀⠀⠀⠀⣀⣤⡶⠶⠖⠚⢛⠛⠳⢶⣼⡟⠀⠀⢀⣼⣹⣿⢀⠀⠀⠀⠀⡀⠀⠀⠀⠀⠀⢀⣀⣠⡤⢤⣾⣿⣿⣿⡿⠿⠛⠉⠹⡇⠀⠀⣿⣿⣟⢿⣿⣿⠹⣶⣿⡿⠛⠻⣏⠀⠉⠉⡛⣿⡿⣾⡇
⠀⠀⠀⢀⣴⠞⠋⢰⡇⢰⣿⢻⢻⢻⢶⣦⠙⣷⡀⠀⣸⢧⠟⢿⣿⣿⣿⣷⣶⣶⣤⣴⣲⡾⠿⠟⠒⠒⠛⡇⠙⣿⠉⠀⢧⠀⠀⠀⠀⣧⠀⠀⢸⣿⣿⡎⣿⠁⢀⣼⣏⢀⣠⣤⣸⣶⠀⠀⣿⣿⣿⠛⠁
⠀⠀⠀⣾⠃⠀⣠⡬⣤⣼⣛⠾⣼⣞⡾⡟⠀⠘⣧⣠⣏⡞⠀⠈⠻⣿⡏⢹⡟⠛⠻⣿⠁⠀⠀⠀⠀⠀⠀⣇⠀⣿⠀⠀⢸⡄⠀⠀⠀⢸⠀⠀⠘⣿⣿⣇⣿⣴⡞⢣⣽⣿⣿⣿⣿⣿⠀⠀⣿⣿⡟⠀⠀
⠀⠀⠀⣿⡶⣿⣿⣸⣿⣿⣿⠿⠷⠾⢽⣅⡲⠶⢻⣿⣼⢁⣠⣤⣶⣿⣿⠘⡇⠀⠀⢻⡆⠀⠀⠀⠀⠀⢀⣸⡀⢹⡇⠀⠈⡇⠀⠀⠀⠈⡇⠀⠀⢿⣿⣿⢹⣿⣤⣿⣿⣿⣿⡿⢿⣟⡀⠀⣿⣿⡇⠀⠀
⠀⠀⠀⠈⠛⠿⢯⣜⣿⠏⠀⠀⠀⢀⡿⣨⣿⣶⣤⣿⣷⣯⣿⣿⣿⣿⣿⠀⡇⠀⠀⠐⡿⣦⣰⣒⣶⣿⣿⣿⣷⣾⣇⠀⠀⢻⠀⠀⠀⠀⢷⠀⠀⢸⣿⣿⣾⣿⣸⣿⡏⢠⠟⣠⣿⣿⣿⣦⡈⢹⡇⠀⠀
⠀⠀⠀⠀⠀⠀⠀⢸⡟⣾⠄⠀⠀⣸⡇⣿⣿⣿⠟⠋⠛⢿⣿⣿⣿⣿⣿⡄⢻⠀⠀⠀⡇⠈⠙⣿⣿⣿⣿⣿⣿⣿⣿⠀⠀⢸⡆⠀⠀⠀⢸⡄⠀⠀⣿⣿⣇⣿⠛⠛⠻⣿⣺⣿⣿⣿⣿⣿⣿⡿⠃⠀⠀
⠀⠀⠀⠀⠀⠀⠀⣼⢧⡇⠀⠀⠀⣿⢸⣿⣿⡿⢦⣴⣿⣿⣷⡿⣿⡿⣿⡇⢸⡄⠀⠀⢹⠀⠀⣿⣿⣿⣿⣿⣿⣿⣿⡆⠀⠀⣇⠀⠀⠀⠀⣇⠀⠀⢸⣿⣟⢿⡀⠀⠀⠈⠉⠀⠉⠉⠉⠁⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⣿⣨⡧⠤⠤⢤⣇⡾⣿⣿⣠⣿⣿⣿⣿⣿⣿⣽⣿⣿⣷⠀⣇⠀⠀⢸⠀⠀⢸⢻⣿⣿⣿⣿⡇⣿⣿⠀⠀⢹⡄⠀⠀⢀⣸⠀⠀⠸⣿⣿⣼⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢀⡿⣧⣤⠶⠦⣼⣿⣿⣿⡏⠈⣿⣿⢿⣿⣿⣿⣏⠉⢹⣿⡀⢻⠀⠀⠘⡇⠀⠸⡄⠙⢿⣿⣿⠇⣿⣿⡄⠀⠈⠓⠒⠋⠉⠀⠀⠀⠀⢿⠹⣯⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⣸⣿⢃⡏⠀⠀⢻⣿⣿⣽⣿⣦⠘⣿⣿⣿⣿⣿⢻⣿⣾⣿⡇⠘⡇⠀⠀⣇⠀⠀⣇⠀⠀⠙⢿⡇⣿⢸⣧⠀⠀⠀⠀⡴⠒⢶⠀⠀⠀⠘⣆⠀⢻⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⡿⡅⣸⢁⣄⡄⣾⣿⢿⣿⠿⣿⣿⢻⣿⣿⣟⣿⣸⣻⡿⣿⣧⠀⠙⠒⠛⠛⠀⠀⢿⣿⣄⠀⠀⠀⣿⠈⣿⡄⠀⠀⠀⡇⠀⠘⡇⠀⠀⠀⢿⣦⢸⡆⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢸⣧⡇⣿⣼⣿⠃⣿⣿⣾⣿⣷⣤⡿⠿⢿⣿⣿⣇⣿⡟⠋⠀⣿⡀⠀⣴⠲⡆⠀⠀⠸⣿⣿⣦⠀⠀⢸⡀⢹⣧⠀⠀⠀⣇⠀⠀⢹⠀⠀⠀⠸⣿⡟⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢽⡿⣷⠏⠛⠿⢠⣿⣿⣿⣿⢿⣯⡇⠀⠀⠈⠁⠀⠀⠀⠀⠀⢸⣇⠀⢻⠀⢳⠀⠀⠀⣿⣿⣿⣷⣾⢸⡇⠈⣿⡀⠀⠀⢸⠀⠀⠈⡇⠀⠀⢀⣿⣿⣷⣀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠘⣧⡙⣀⣀⣀⣸⣿⣽⣿⣿⠀⠈⠙⣶⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⡀⢸⡀⠸⡄⠀⠀⢻⣿⣿⣿⣿⡼⡇⠀⢘⣧⣤⡴⠾⠷⠶⠖⠛⠛⢛⠋⠉⢿⢹⠉⣭⡿⠿⠷⠶⢦⡄⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠹⣟⣁⣸⣿⣿⣧⡿⠿⣿⣀⡀⠀⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣇⣈⣧⣘⣷⣤⣤⣼⠿⠿⣿⣿⣧⣧⡀⣸⢹⡏⠀⠀⠀⠀⠀⠀⠀⠈⡇⠀⢸⢸⡄⡿⠖⠚⠉⡉⠓⢿⡀⠀⠀⠀⠀
⠀⠀⠀⠀⣠⡴⣾⠋⠉⢙⣻⣷⠛⠛⠳⠶⠶⠽⠿⠃⠀⠀⠀⠀⠀⣀⡤⣼⡿⠋⠉⠁⠀⠀⣠⠀⣿⣿⠀⠀⠀⠀⠈⠉⠻⣿⢸⣷⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠸⡏⡇⣿⠀⠀⠀⢻⣷⢸⡇⠀⠀⠀⠀
⠀⠀⠀⠀⡟⠀⡟⠀⠀⢸⣿⣿⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣾⡥⢺⠏⡆⠀⠀⠀⠀⠀⡏⠀⡟⡇⠀⠀⠀⠀⠀⠀⢀⡇⢸⣿⠀⠀⠀⠀⠀⠀⠀⠀⡇⠀⠀⡇⡇⢿⠀⠀⠀⢸⣿⡌⣷⠀⠀⠀⠀
⠀⠀⠀⢸⠇⢠⡇⠀⠀⢰⣿⣯⣏⣻⡆⠀⠀⠀⠀⠀⠀⠀⠀⣸⠃⢀⡿⢸⡇⠀⠀⠀⠀⢠⡇⠀⡇⡇⠀⠀⠀⠀⠀⠀⢸⡇⢸⣿⡆⠀⠀⠀⠀⠀⠀⠀⣧⠀⠀⡇⢿⢸⠀⠀⠀⠈⣿⡇⢹⡀⠀⠀⠀
⠀⠀⠀⡟⡄⣼⠀⠀⢀⣿⣿⣿⣿⣿⣷⠀⠀⠀⠀⠀⠀⠀⠀⣿⠀⢸⡇⣸⡇⠀⠀⠀⠀⢸⠁⢸⣷⡇⠀⠀⠀⠀⠀⠀⢸⡇⢸⣿⡇⠀⠀⠀⠀⠀⠀⠀⢻⠀⠀⢹⢸⣼⡀⠀⣀⣀⣿⣧⣸⡇⠀⠀⠀
⠀⠀⢰⢧⣇⡏⠀⠀⣸⣿⠿⢭⣿⣿⡏⠀⠀⠀⠀⠀⠀⠀⢰⡏⠀⣿⠀⣿⡇⠀⠀⠀⠀⢸⠀⢸⢸⠁⠀⠀⠀⠀⠀⠀⢸⡇⢸⣿⣿⠀⠀⠀⠀⠀⠀⠀⢸⠀⠀⢸⢸⣿⡏⢉⣁⣤⣤⣄⢈⡇⠀⠀⠀
⠀⠀⣼⢼⣿⠃⠀⠀⣿⣿⠀⢸⣿⣿⡇⠀⠀⠀⠀⠀⠀⠀⢸⡇⢠⡿⢰⣿⠃⠀⠀⠀⠀⣼⠀⢸⢸⠀⠀⠀⠀⠀⠀⠀⢸⡇⢸⢹⣸⣦⣤⣤⣤⣶⣶⣶⡿⠀⠀⢸⡄⡇⣧⣽⣿⣿⣿⡽⠟⠁⠀⠀⠀
⠀⠀⢿⢻⡏⠀⠀⢰⣿⣿⣟⠛⢿⣿⡇⠀⠀⠀⠀⠀⠀⠀⢸⠗⣻⡇⢸⢹⣆⣀⣀⣀⣤⡏⠀⢸⢸⠀⠀⠀⠀⠀⠀⠀⢸⡇⢸⠈⠉⠉⠉⠉⠉⠉⠀⠀⠀⠀⠀⠈⡇⣿⠘⣿⣿⣿⣇⠀⠀⠀⠀⠀⠀
⠀⠀⢸⠛⠤⢤⣤⣘⢺⣿⣿⣿⣿⡿⠃⠀⠀⠀⠀⠀⠀⠀⠸⢧⣿⠃⠘⠓⠛⠛⠛⠋⠉⠁⠀⢼⢸⠀⢰⡾⠿⠛⠛⠿⢿⡇⣇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇⢸⠀⠙⣿⣿⣿⠀⠀⠀⠀⠀⠀
⠀⠀⢘⣶⡶⠚⠿⢿⣿⣩⢿⢿⡏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣸⠀⢸⡇⠀⠀⠀⠀⣿⡇⡏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢷⢸⡀⠀⠈⠁⢸⡇⠀⠀⠀⠀⠀
⠀⠀⣼⣹⠃⠀⢰⣷⢻⠁⠈⠛⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡟⣹⠀⢸⠃⠀⠀⠀⠀⣿⠇⠜⠀⣤⠶⠖⠛⠛⠋⠉⠉⢩⣿⡇⠀⢸⠸⡇⠀⠀⠀⠘⡇⠀⠀⠀⠀⠀
⠀⢠⡟⠏⠀⠀⣾⣿⣼⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣾⡇⠀⢀⣴⠶⠞⠛⠛⣻⣷⠀⡏⣿⠀⢸⢀⣴⣷⣦⡀⣿⠇⡇⠀⡟⠀⣀⣀⣀⣀⣀⣀⣸⣿⡇⠀⢸⡆⡇⠀⠀⠀⠀⣷⠀⠀⠀⠀⠀
⠀⣸⠇⠀⠀⢸⣿⡇⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⡇⠀⣾⣀⣀⣤⣤⣶⣿⡿⠀⡇⣿⠀⢸⣿⣿⣿⣫⣾⣿⠀⡇⢠⣟⣿⣿⣿⡿⠿⠿⠿⠿⠁⡇⠀⠈⡇⣷⢀⡀⠀⠀⢻⠀⠀⠀⠀⠀
⠀⣿⡼⠀⠀⡟⣿⣷⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢰⣿⠀⢀⡟⡿⠿⠟⠛⠛⣃⡇⠀⡇⣿⠀⢸⣿⣿⣿⣿⣿⣿⡄⡇⢸⡇⠀⠀⠀⠀⠀⠀⠀⢰⣶⡇⠀⠀⣇⢹⣾⣿⠀⣰⢾⡆⠀⠀⠀⠀
⢠⣿⡇⠀⢸⣷⣿⣹⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⠀⢸⡇⠀⠀⠀⠀⢰⣿⡇⠀⡇⣿⠀⣾⣿⣿⣿⣿⣿⣿⡃⡇⢸⣧⣤⣤⣴⣶⣶⣶⣶⣾⣿⡇⠀⠀⢿⢸⣿⣿⣾⣿⣸⡇⠀⠀⠀⠀
⢸⢭⠥⠦⣬⣽⣧⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⠀⢸⢵⣶⣾⣿⣿⣿⡿⡇⠀⡇⣿⠀⣿⣿⣿⣿⣿⣿⣿⡇⡇⢸⡏⠿⠟⠛⠛⠛⠛⠛⠛⣧⣷⠀⠀⢸⠀⣿⣿⣿⣿⠛⣇⠀⠀⠀⠀
⢸⣸⠁⢠⣿⣿⣹⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⡇⠀⢸⠉⠉⠉⠁⠀⢠⣾⡇⠀⡇⣿⠀⣿⣿⣿⣿⣿⣿⣿⠇⡇⢸⡇⠀⣀⣀⣀⣀⣀⣀⣰⣿⣿⠀⠀⠸⠀⣿⣿⣿⣵⡇⣿⠀⠀⠀⠀
⠘⣧⣰⠞⣞⣷⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⡇⠀⢸⣀⣀⣠⣤⣤⣼⣿⡇⠀⡇⣿⠀⢈⣭⣭⠭⠽⠭⣿⡇⡇⢸⣟⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⡀⠀⠀⠀⢻⣟⣾⣿⣿⢻⠀⠀⠀⠀
⠀⠈⠛⠛⠛⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⡇⠀⣿⠿⠿⠿⠿⠟⢛⣻⡇⠀⡇⢻⠀⢸⠁⠀⠀⠀⠀⣿⡇⡇⠸⡏⠉⠀⠀⠀⠀⠀⠀⠀⣼⣿⡇⠀⠀⠀⢸⣿⣿⣿⣿⢸⡆⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣇⠀⣿⣀⣤⣤⣤⣤⣼⣿⡇⠀⡇⢸⠀⢸⠀⣠⣶⣄⠀⣿⡇⣇⠀⡇⣴⣶⣶⣾⣿⣿⣿⣿⣿⣿⣇⣀⣂⠀⢸⣿⣿⣿⣿⣿⡇⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣿⡿⠴⠿⠿⠿⠿⠿⠿⠿⠿⠷⣦⡄⢸⠀⢸⣾⣿⣿⢟⣴⣿⣷⣼⠶⠗⠛⠛⠛⠛⠛⠛⠛⠋⠉⠉⠉⢉⡟⣧⠈⣿⣿⣿⣿⡿⣧⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣴⣿⠟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⣴⣿⡇⢸⣴⢾⣿⡿⣻⣿⣿⣿⣿⠏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿⠀⣿⣿⣿⣿⣿⣿⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣾⠟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣼⣿⣿⡇⢸⣿⢸⣿⣿⣿⣿⣿⣿⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣼⣿⣿⡀⣿⣿⣿⣿⣿⣿⡄⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣰⣿⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣾⣿⣿⣿⡇⢸⣿⢸⣿⣿⣿⣿⣿⠃⠀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣀⣿⣿⣿⡇⢸⣿⣿⣿⣿⢻⡇⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣰⣿⣷⣶⣶⣶⣶⣶⣶⣶⡶⠶⠦⠤⣾⣿⣿⣿⣿⣷⢘⣿⢸⣿⣿⣿⣿⡏⣭⠭⠭⠭⠤⠤⠤⠴⠶⠶⠶⠶⠶⠶⠶⠱⣌⢻⣿⣧⢸⣿⣿⣿⣿⣾⣇⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣴⡾⠟⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⣉⣽⣿⣾⣿⣿⣿⣿⣿⠀⣿⢸⣿⣿⣿⡟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⡞⣿⢻⠈⣿⣿⣿⣿⣿⣿⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣶⠟⠋⠀⠀⠀⠀⠀⠀⠀⠀⢀⣠⣴⣾⣿⣿⣿⣿⣿⣿⣿⠛⢹⠀⣿⣾⣿⣿⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢰⣿⣿⣿⢻⣿⡀⣿⣿⣿⣿⣿⣿⡄⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⣿⣀⣤⣄⣤⣤⣄⣀⣀⣀⣀⣶⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣅⢸⠀⣿⡿⣿⣿⣤⣤⣤⡤⠤⠤⠶⠶⠶⠖⠒⠒⠒⠚⠛⠛⠛⠺⣿⣿⣿⡇⠹⡇⣿⣿⣿⣿⣿⣿⡇⠀
⠀⠀⠀⠀⠀⠀⠀⠀⣸⣿⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢿⣿⣿⣿⣿⡟⠉⢹⣿⣿⣿⣿⡿⠿⡾⠀⣿⡇⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⣿⣿⠰⠇⣿⣿⣿⣿⡿⣿⡇⠀
⠀⠀⠀⠀⠀⠀⠀⠀⢿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⡟⠛⠉⠁⠀⠀⠀⠙⠛⠉⠁⠀⠀⠁⠀⣛⣁⣿⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⢿⠟⣹⡇⢀⣙⣿⣯⡷⠿⠛⠁⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠈⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠛⠉⠉⠉⠉⠉⠉⠹⠷⣦⣤⣤⣤⣤⣤⣤⣤⣤⣤⣶⣶⣶⡶⠶⠶⠶⠶⠾⠿⠛⠛⠋⠉⠉⠁⠀⠀
</pre>
  </div>
</div>

# **Setup**

Import all libraries needed

In [1]:
import os
import time
import copy
from datetime import datetime
from collections import Counter

import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.nn.utils.rnn import pad_sequence, pack_padded_sequence
from transformers import BertConfig, BertForMultipleChoice, BertTokenizer, AutoTokenizer, AutoModelForCausalLM, get_linear_schedule_with_warmup, GenerationConfig

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
import gensim

from datasets import load_dataset
from huggingface_hub import hf_hub_download

from tqdm import tqdm, trange
import wandb


import re
import string
import html
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import words as nltk_words

Download necessary NLTK resources. I just downloaded everything that I might use later.

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('words')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\fabia\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

Setup random seed to ensure reproducibility.

_Info about the seed value: The field of natural language processing began in the 1940s, after World War II. At this time, people recognized the importance of translation from one language to another and hoped to create a machine that could do this sort of translation automatically._

In [3]:
SEED = 1940 # normal: 42

np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In the next step I import and split the dataset. For the split I took off the last 1000 entries from the train-split and used it as validation, the rest of this is of course used for the training. Then I used the validation-part as the test. This was done since the real test-split has no answer keys.

In [4]:
train = load_dataset("tau/commonsense_qa", split="train[:-1000]")
valid = load_dataset("tau/commonsense_qa", split="train[-1000:]")
test = load_dataset("tau/commonsense_qa", split="validation")

print(len(train), len(valid), len(test))

8741 1000 1221


Login for the experiment tracking.

In [None]:
wandb.login()

# **Data Exploration**

In [None]:
print("\033[4m" + "Dataset Features" + "\033[0m")
for feature in train.features:
    print(feature)
print("\n" + "\033[4m" + "Example" + "\033[0m")
for feature in train.features:
    print(feature + ":", train[0][str(feature)])

# **Preprocessing**

For the preprocessing I looked at the following points:

1. Tokenization
2. Lowercasing, stemming, lemmatizing, stopword/punctuation removal 
3. Removal of unknown/other words 
4. Format cleaning (e.g. html-extracted text) 
5. Truncation 
6. Feature selection 

**Reasoning why used**

In [None]:
tokenizer_bert = BertTokenizer.from_pretrained('bert-base-cased')

In [5]:
tokenizer_deepseek = AutoTokenizer.from_pretrained('deepseek-ai/DeepSeek-V2-Lite', trust_remote_code=True)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
def preprocess_commonsenseqa(examples):
    
    # Extract questions and choices
    questions = [q for q in examples['question']]
    
    # Initialize arrays for input_ids, attention_masks, and token_type_ids
    all_input_ids = []
    all_attention_mask = []
    all_token_type_ids = []
    
    # Get correct answer indices
    answerkeys = examples['answerKey']
    labels = []
    
    # Convert letter answers to indices (A->0, B->1, etc.)
    for key in answerkeys:
        labels.append(ord(key) - ord('A'))
    
    # Process each question with its choices
    for i, (question, choices) in enumerate(zip(questions, examples['choices'])):
        inputs = []
        
        # Process each choice for the current question
        for choice in choices['text']:
            # Combine question and choice
            text_a = question
            text_b = choice
            
            # Tokenize
            encoded = tokenizer_bert(
                text_a, text_b,
                add_special_tokens=True,
                max_length=128,
                padding='max_length',
                truncation=True,
                return_tensors='pt'
            )
            
            inputs.append({
                'input_ids': encoded['input_ids'],
                'attention_mask': encoded['attention_mask'],
                'token_type_ids': encoded['token_type_ids']
            })
        
        # Stack tensors for all choices of this question
        input_ids = torch.cat([x['input_ids'] for x in inputs])
        attention_mask = torch.cat([x['attention_mask'] for x in inputs])
        token_type_ids = torch.cat([x['token_type_ids'] for x in inputs])
        
        all_input_ids.append(input_ids)
        all_attention_mask.append(attention_mask)
        all_token_type_ids.append(token_type_ids)
    
    # Convert lists to tensors
    return {
        'input_ids': all_input_ids,
        'attention_mask': all_attention_mask,
        'token_type_ids': all_token_type_ids,
        'labels': labels
    }

In [None]:
# Apply preprocessing
train_dataset = preprocess_commonsenseqa(train)
validation_dataset = preprocess_commonsenseqa(valid)
test_dataset = preprocess_commonsenseqa(test)

In [None]:
# Convert to PyTorch datasets
train_features = TensorDataset(
    torch.stack(train_dataset['input_ids']),
    torch.stack(train_dataset['attention_mask']),
    torch.stack(train_dataset['token_type_ids']),
    torch.tensor(train_dataset['labels'])
)

val_features = TensorDataset(
    torch.stack(validation_dataset['input_ids']),
    torch.stack(validation_dataset['attention_mask']),
    torch.stack(validation_dataset['token_type_ids']),
    torch.tensor(validation_dataset['labels'])
)

test_features = TensorDataset(
    torch.stack(test_dataset['input_ids']),
    torch.stack(test_dataset['attention_mask']),
    torch.stack(test_dataset['token_type_ids']),
    torch.tensor(test_dataset['labels'])
)

In [6]:
def process_commonsense_qa_for_deepseek(dataset, model, tokenizer, num_examples=5, max_new_tokens=100, include_answer=False):
    results = {
        'questions': [],
        'prompts': [],
        'responses': [],
        'correct_answers': []
    }
    
    # Limit the dataset to the first num_examples
    if num_examples is not None:
        # Slice the dataset
        limited_dataset = dataset.select(range(min(num_examples, len(dataset))))
    else:
        limited_dataset = dataset
    
    # Check if question_concept is in the dataset
    has_question_concept = 'question_concept' in limited_dataset.column_names
    
    # Process each example
    for i in range(len(limited_dataset)):
        question = limited_dataset['question'][i]
        
        # Get question_concept if available
        question_concept = limited_dataset['question_concept'][i] if has_question_concept else ''
        
        # Format the choices - correctly accessing the nested structure
        choices = limited_dataset['choices'][i]  # This is a dictionary with 'label' and 'text' keys
        choice_labels = choices['label']  # Already a list
        choice_texts = choices['text']   # Already a list
        choices_text = ", ".join([f"{label}. {text}" for label, text in zip(choice_labels, choice_texts)])
        
        # Get answer key if available
        answer_key = limited_dataset['answerKey'][i] if 'answerKey' in limited_dataset.column_names else None
        
        # Build the prompt
        prompt = f"Question: {question}\n"
        if question_concept:
            prompt += f"Question concept: {question_concept}\n"
        prompt += f"Choices: {choices_text}\n"
        
        if include_answer and answer_key:
            prompt += f"The correct answer is: {answer_key}"
        else:
            prompt += "The correct answer is:"
        
        # Tokenize and generate
        inputs = tokenizer(prompt, return_tensors="pt")
        inputs = {k: v.to(model.device) for k, v in inputs.items()}  # Move inputs to the model's device
        
        try:
            outputs = model.generate(**inputs, max_new_tokens=max_new_tokens)
            
            # Decode the result
            result = tokenizer.decode(outputs[0], skip_special_tokens=True)
        except Exception as e:
            print(f"Error generating response for example {i}: {e}")
            result = f"Error: {str(e)}"
        
        # Store the results
        results['questions'].append(question)
        results['prompts'].append(prompt)
        results['responses'].append(result)
        results['correct_answers'].append(answer_key if answer_key else "N/A")
    
    return results

# **Model**

In [None]:
config_bert = BertConfig.from_pretrained('bert-base-cased')
random_bert_model = BertForMultipleChoice(config_bert)

In [None]:
print(random_bert_model)

In [None]:
pretrained_bert_model = BertForMultipleChoice.from_pretrained('bert-base-cased')

In [None]:
print(pretrained_bert_model)

In [7]:
model_deepseek = AutoModelForCausalLM.from_pretrained('deepseek-ai/DeepSeek-V2-Lite', trust_remote_code=True, torch_dtype=torch.bfloat16, attn_implementation="eager")
model_deepseek.generation_config = GenerationConfig.from_pretrained('deepseek-ai/DeepSeek-V2-Lite')
model_deepseek.generation_config.pad_token_id = model_deepseek.generation_config.eos_token_id

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

# **Training**

Example for training

In [8]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

Using device: cuda


In [None]:
# Create DataLoaders
batch_size = 32

train_dataloader = DataLoader(train_features, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_features, batch_size=batch_size)
test_dataloader = DataLoader(test_features, batch_size=batch_size)

In [None]:
def train_transformer(model, train_dataloader, val_dataloader, device, 
                      epochs=10, learning_rate=1e-4, warmup_steps=None,
                      log_wandb=True, gradient_clip_val=5.0, save_interval=1,
                      gradient_accumulation_steps=4, patience=3,
                      weight_decay=0.001, save_path=None):
        
    # Handle save path for checkpoints
    if save_path:
        # Make sure parent directory exists
        checkpoint_dir = save_path
        os.makedirs(checkpoint_dir, exist_ok=True)
        
        # Define best model path inside the checkpoint directory
        best_model_path = os.path.join(checkpoint_dir, "best_transformer_model.pt")
    else:
        print("Warning: No save_path provided. Model and checkpoints will not be saved.")
        checkpoint_dir = None
        best_model_path = None
    
    # Move model to device
    model = model.to(device)
    
    # Enable gradient checkpointing for memory efficiency
    model.gradient_checkpointing_enable()
    
    # Initialize wandb logging if enabled
    if log_wandb:
        import wandb
        # Check if wandb is initialized, if not initialize it
        if not wandb.run:
            wandb.init(
                project="CommonsenseQA",
                # Change the name to the current transformer model
                name=f"pretrained_transformer-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}",
                config={
                    "learning_rate": learning_rate,
                    "epochs": epochs,
                    "batch_size": train_dataloader.batch_size,
                    "gradient_accumulation_steps": gradient_accumulation_steps,
                    "effective_batch_size": train_dataloader.batch_size * gradient_accumulation_steps,
                    "weight_decay": weight_decay,
                    "warmup_steps": warmup_steps,
                    "gradient_clip_val": gradient_clip_val,
                    "model_type": model.__class__.__name__,
                      })
    
    # Initialize CrossEntropy Loss
    criterion = torch.nn.CrossEntropyLoss()
    
    # Initialize optimizer
    optimizer = optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=weight_decay)
    
    # Calculate total training steps
    total_steps = len(train_dataloader) * epochs // gradient_accumulation_steps
    
    # Set default warmup steps if not provided
    if warmup_steps is None:
        warmup_steps = len(train_dataloader)  # One epoch of warmup
    
    # Initialize scheduler
    scheduler = get_linear_schedule_with_warmup(
        optimizer, 
        num_warmup_steps=warmup_steps, 
        num_training_steps=total_steps
    )
    
    # Lists to store metrics
    train_losses = []
    val_accuracies = []
    
    # Variables to track best model
    best_accuracy = 0.0
    early_stopping_counter = 0
    
    # Training loop
    for epoch in range(epochs):
        print(f"\nEpoch {epoch+1}/{epochs}")
        
        # Training phase
        model.train()
        epoch_loss = 0.0
        epoch_correct = 0
        epoch_total = 0
        progress_bar = tqdm(train_dataloader, desc="Training")
        optimizer.zero_grad()  # Zero gradients once at the beginning of epoch
        
        for i, batch in enumerate(progress_bar):
            # Move batch to device
            input_ids, attention_mask, token_type_ids, labels = [b.to(device) for b in batch]
            
            # Forward pass
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids
            )
            
            # Use CrossEntropy Loss
            criterion = torch.nn.CrossEntropyLoss()
            logits = outputs.logits if hasattr(outputs, 'logits') else outputs
            loss = criterion(logits, labels)
            
            # Calculate training accuracy
            _, preds = torch.max(logits, dim=1)
            epoch_correct += (preds == labels).sum().item()
            epoch_total += labels.size(0)
            
            # Scale loss for gradient accumulation
            loss_to_backward = loss / gradient_accumulation_steps
            
            # Backward pass
            loss_to_backward.backward()
            
            # Update weights every gradient_accumulation_steps batches
            if (i + 1) % gradient_accumulation_steps == 0 or (i + 1) == len(train_dataloader):
                # Clip gradients using the provided gradient_clip_val parameter
                torch.nn.utils.clip_grad_norm_(model.parameters(), gradient_clip_val)
                
                # Update weights
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                
                # Log learning rate
                if log_wandb:
                    wandb.log({"learning_rate": scheduler.get_last_lr()[0]})
            
            # Update progress bar (use the unscaled loss for display)
            progress_bar.set_postfix({"loss": loss.item()})
            
            # Accumulate loss (use the unscaled loss for logging)
            epoch_loss += loss.item()
        
        # Calculate average loss and accuracy for the epoch
        avg_train_loss = epoch_loss / len(train_dataloader)
        train_accuracy = epoch_correct / epoch_total
        train_losses.append(avg_train_loss)
        print(f"Training loss: {avg_train_loss:.4f}, accuracy: {train_accuracy:.4f}")
        
        # Validation phase
        model.eval()
        correct = 0
        total = 0
        val_loss = 0.0
        
        # No gradient computation for validation
        with torch.no_grad():
            progress_bar = tqdm(val_dataloader, desc="Validation")
            
            for batch in progress_bar:
                # Move batch to device
                input_ids, attention_mask, token_type_ids, labels = [b.to(device) for b in batch]
                
                # Forward pass
                outputs = model(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    token_type_ids=token_type_ids
                )
                logits = outputs.logits if hasattr(outputs, 'logits') else outputs
                
                # Use CrossEntropy Loss
                criterion = torch.nn.CrossEntropyLoss()
                loss = criterion(logits, labels)
                
                val_loss += loss.item()
                
                # Get predictions
                _, preds = torch.max(logits, dim=1)
                
                # Calculate accuracy
                correct += (preds == labels).sum().item()
                total += labels.size(0)
                
                # Update progress bar
                progress_bar.set_postfix({"acc": correct/total})
            
        # Calculate validation metrics
        val_accuracy = correct / total
        avg_val_loss = val_loss / len(val_dataloader)
        val_accuracies.append(val_accuracy)
        print(f"Validation loss: {avg_val_loss:.4f}, accuracy: {val_accuracy:.4f}")
        
        # Log metrics to wandb
        if log_wandb:
            wandb.log({
                "epoch": epoch,
                "train_loss": avg_train_loss,
                "train_accuracy": train_accuracy,
                "val_loss": avg_val_loss,
                "val_accuracy": val_accuracy,
                "learning_rate": scheduler.get_last_lr()[0]
            })
        
        # Save checkpoint at specified interval
        if checkpoint_dir and (epoch + 1) % save_interval == 0:
            checkpoint_path = os.path.join(checkpoint_dir, f"checkpoint_epoch_{epoch+1}.pt")
            model_to_save = model.module if hasattr(model, 'module') else model
            
            # Create checkpoint with additional information
            checkpoint = {
                'epoch': epoch + 1,
                'model_state_dict': model_to_save.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict': scheduler.state_dict(),
                'best_accuracy': best_accuracy,
                'train_losses': train_losses,
                'val_accuracies': val_accuracies
            }
            torch.save(checkpoint, checkpoint_path)
            print(f"Checkpoint saved to {checkpoint_path}")
        
        # Save best model
        if best_model_path and val_accuracy > best_accuracy:
            best_accuracy = val_accuracy
            # Save model
            model_to_save = model.module if hasattr(model, 'module') else model
            torch.save(model_to_save.state_dict(), best_model_path)
            print(f"Best model saved to {best_model_path}")
            
            # Also save as checkpoint with additional metadata
            best_checkpoint_path = os.path.join(checkpoint_dir, f"best_checkpoint_epoch_{epoch+1}.pt")
            checkpoint = {
                'epoch': epoch + 1,
                'model_state_dict': model_to_save.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict': scheduler.state_dict(),
                'best_accuracy': best_accuracy,
                'train_losses': train_losses,
                'val_accuracies': val_accuracies
            }
            torch.save(checkpoint, best_checkpoint_path)
            
            # Log best model to wandb
            if log_wandb:
                wandb.run.summary["best_accuracy"] = best_accuracy
                wandb.run.summary["best_epoch"] = epoch + 1
            
            # Reset early stopping counter
            early_stopping_counter = 0
        else:
            # Increment early stopping counter
            early_stopping_counter += 1
            print(f"No improvement for {early_stopping_counter} epochs")
            
            # Check if we should stop early
            if early_stopping_counter >= patience:
                print(f"Early stopping after {epoch+1} epochs")
                if log_wandb:
                    wandb.run.summary["stopped_epoch"] = epoch + 1
                break
    
    # Load best model if it was saved
    if best_model_path and os.path.exists(best_model_path):
        model.load_state_dict(torch.load(best_model_path))
        print(f"Loaded best model from {best_model_path}")
    
    # Finish wandb run
    if log_wandb:
        wandb.finish()
    
    return model, train_losses, val_accuracies

In [None]:
random_model, random_losses, random_accuracies = train_transformer(
    random_bert_model, 
    train_dataloader, 
    val_dataloader, 
    device,
    epochs=20,
    learning_rate=1e-5,
    warmup_steps=0.1 * len(train_dataloader),
    gradient_clip_val=5.0,
    save_interval=1,
    gradient_accumulation_steps=4,  # Effectively creates a batch size of 128 (batch_size=32 * 4)
    patience=3,
    weight_decay=0.01,
    save_path=f"./checkpoints/random_transformer-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
)

In [None]:
pretrained_model, pretrained_losses, pretrained_accuracies = train_transformer(
    pretrained_bert_model, 
    train_dataloader, 
    val_dataloader, 
    device,
    epochs=20,
    learning_rate=1e-5,
    warmup_steps=0.1 * len(train_dataloader),
    gradient_clip_val=5.0,
    save_interval=1,
    gradient_accumulation_steps=4,  # Effectively creates a batch size of 128 (batch_size=32 * 4)
    patience=5,
    weight_decay=0.01,
    save_path=f"./checkpoints/pretrained_transformer-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
)

In [None]:
import wandb
import torch
from torch.utils.data import DataLoader
from datetime import datetime
import os

def objective_function_pretrained(config=None):
    """Objective function for hyperparameter optimization of pretrained model"""
    # Initialize a new wandb run for this trial
    with wandb.init(config=config):
        # Get the configuration for this run
        config = wandb.config
        
        # Create dataloaders with the batch size from config
        train_batch_size = config.batch_size
        
        train_dataloader = DataLoader(
            train_features,
            batch_size=train_batch_size,
            shuffle=True,
            num_workers=4,
            pin_memory=True  # Faster transfers to GPU
        )
        
        val_dataloader = DataLoader(
            val_features,
            batch_size=train_batch_size,
            shuffle=False,
            num_workers=4,
            pin_memory=True
        )
        
        # Use the training function with model_pretrained
        model, train_losses, val_accuracies = train_transformer(
            model=pretrained_bert_model,  # Using the pretrained model
            train_dataloader=train_dataloader,
            val_dataloader=val_dataloader,
            device=device,
            epochs=config.epochs,
            learning_rate=config.learning_rate,
            warmup_steps=config.warmup_ratio * len(train_dataloader),
            gradient_clip_val=config.gradient_clip_val,
            save_interval=1,
            gradient_accumulation_steps=config.gradient_accumulation_steps,
            patience=config.patience,
            weight_decay=config.weight_decay,
            save_path=f"./checkpoints/pretrained-sweep-{wandb.run.id}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
        )
        
        # Return the best validation accuracy
        return max(val_accuracies)

def objective_function_random(config=None):
    """Objective function for hyperparameter optimization of random initialized model"""
    # Initialize a new wandb run for this trial
    with wandb.init(config=config):
        # Get the configuration for this run
        config = wandb.config
        
        # Create dataloaders with the batch size from config
        train_batch_size = config.batch_size
        
        train_dataloader = DataLoader(
            train_features,
            batch_size=train_batch_size,
            shuffle=True,
            num_workers=4,
            pin_memory=True
        )
        
        val_dataloader = DataLoader(
            val_features,
            batch_size=train_batch_size,
            shuffle=False,
            num_workers=4,
            pin_memory=True
        )
        
        # Use the training function with model_random
        model, train_losses, val_accuracies = train_transformer(
            model=random_bert_model,  # Using the randomly initialized model
            train_dataloader=train_dataloader,
            val_dataloader=val_dataloader,
            device=device,
            epochs=config.epochs,
            learning_rate=config.learning_rate,
            warmup_steps=config.warmup_ratio * len(train_dataloader),
            gradient_clip_val=config.gradient_clip_val,
            save_interval=1,
            gradient_accumulation_steps=config.gradient_accumulation_steps,
            patience=config.patience,
            weight_decay=config.weight_decay,
            save_path=f"./checkpoints/random-sweep-{wandb.run.id}-{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}"
        )
        
        # Return the best validation accuracy
        return max(val_accuracies)

def run_sweep_pretrained(count=20):
    """Run hyperparameter sweep for the pretrained model"""
    # Define the parameter search space
    sweep_config = {
        'method': 'bayes',
        'metric': {'name': 'val_accuracy', 'goal': 'maximize'},
        'name': 'pretrained_model_sweep',
        'parameters': {
            'batch_size': {
                'values': [8, 16, 32, 64, 128]
            },
            'learning_rate': {
                'distribution': 'log_uniform',
                'min': 1e-6,
                'max': 1e-3
            },
            'weight_decay': {
                'distribution': 'log_uniform',
                'min': 1e-5,
                'max': 1e-1
            },
            'gradient_clip_val': {
                'distribution': 'uniform',
                'min': 1.0,
                'max': 10.0
            },
            'gradient_accumulation_steps': {
                'values': [1, 2, 4, 8]
            },
            'warmup_ratio': {
                'distribution': 'uniform',
                'min': 0.0,
                'max': 0.5
            },
            'epochs': {
                'values': [50]
            },
            'patience': {
                'values': [5, 10]
            }
        }
    }
    
    # Create the sweep
    sweep_id = wandb.sweep(sweep_config, project="CommonsenseQA")
    
    # Run the sweep
    wandb.agent(sweep_id, function=objective_function_pretrained, count=count)

def run_sweep_random(count=20):
    """Run hyperparameter sweep for the randomly initialized model"""
    # Define the parameter search space
    # For random initialization, we might want to explore different learning rates
    sweep_config = {
        'method': 'bayes',
        'metric': {'name': 'val_accuracy', 'goal': 'maximize'},
        'name': 'random_model_sweep',
        'parameters': {
            'batch_size': {
                'values': [8, 16, 32, 64, 128]
            },
            'learning_rate': {
                'distribution': 'log_uniform',
                'min': 1e-6,
                'max': 1e-3
            },
            'weight_decay': {
                'distribution': 'log_uniform',
                'min': 1e-5,
                'max': 1e-1
            },
            'gradient_clip_val': {
                'distribution': 'uniform',
                'min': 1.0,
                'max': 10.0
            },
            'gradient_accumulation_steps': {
                'values': [1, 2, 4, 8]
            },
            'warmup_ratio': {
                'distribution': 'uniform',
                'min': 0.0,
                'max': 0.5
            },
            'epochs': {
                'values': [50]
            },
            'patience': {
                'values': [5, 10]
            }
        }
    }
    
    # Create the sweep
    sweep_id = wandb.sweep(sweep_config, project="CommonsenseQA")
    
    # Run the sweep
    wandb.agent(sweep_id, function=objective_function_random, count=count)

if __name__ == "__main__":
    # Run pretrained model sweep first
    print("Starting sweep for pretrained model...")
    run_sweep_pretrained(count=50)
    
    # Then run random model sweep
    print("Starting sweep for randomly initialized model...")
    run_sweep_random(count=50)
    
    # Alternatively, you can run them separately:
    # To run only the pretrained model sweep:
    # run_sweep_pretrained(count=20)
    
    # To run only the random model sweep:
    # run_sweep_random(count=20)

In [None]:
def plot_training_curves(random_losses, random_accuracies, pretrained_losses, pretrained_accuracies):
    plt.figure(figsize=(12, 5))
    
    # Plot training loss
    plt.subplot(1, 2, 1)
    plt.plot(random_losses, label='Random Init')
    plt.plot(pretrained_losses, label='Pretrained')
    plt.title('Training Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    
    # Plot validation accuracy
    plt.subplot(1, 2, 2)
    plt.plot(random_accuracies, label='Random Init')
    plt.plot(pretrained_accuracies, label='Pretrained')
    plt.title('Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()
    
    plt.tight_layout()
    plt.show()

In [None]:
plot_training_curves(random_losses, random_accuracies, pretrained_losses, pretrained_accuracies)

# **Evaluation**

Important: Use test split for eval, not validation (& ofc no train)

In [None]:
def evaluate(model, dataloader, device):
    model.eval()
    true_labels = []
    predicted_labels = []
    all_logits = []
    
    with torch.no_grad():
        progress_bar = tqdm(dataloader, desc="Evaluation")
        
        for batch in progress_bar:
            # Move batch to device
            input_ids, attention_mask, token_type_ids, labels = [b.to(device) for b in batch]
            
            # Forward pass
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                token_type_ids=token_type_ids
            )
            
            # Get predictions
            logits = outputs.logits
            _, preds = torch.max(logits, dim=1)
            
            # Collect labels and predictions
            true_labels.extend(labels.cpu().numpy())
            predicted_labels.extend(preds.cpu().numpy())
            all_logits.append(logits.cpu().numpy())
    
    # Combine all logits
    all_logits = np.vstack(all_logits) if all_logits else np.array([])
    
    # Compute metrics
    from sklearn.metrics import (
        accuracy_score, 
        precision_score, 
        recall_score, 
        f1_score, 
        confusion_matrix,
        classification_report
    )
    
    accuracy = accuracy_score(true_labels, predicted_labels)
    
    # Only calculate these metrics if there are predictions for each class
    try:
        precision = precision_score(true_labels, predicted_labels, average='weighted')
        recall = recall_score(true_labels, predicted_labels, average='weighted')
        f1 = f1_score(true_labels, predicted_labels, average='weighted')
    except:
        print("Warning: Some classes may not have predictions. Using only accuracy.")
        precision = recall = f1 = None
    
    # Confusion Matrix
    cm = confusion_matrix(true_labels, predicted_labels)
    
    # Class-wise report 
    class_report = classification_report(true_labels, predicted_labels, output_dict=True)
    
    # Create a list to map index to answer choice label (A-E)
    idx_to_label = {i: chr(65 + i) for i in range(5)}  # 0->A, 1->B, etc.
    
    # Calculate per-class accuracy
    class_accuracies = {}
    for i in range(5):
        class_indices = np.where(np.array(true_labels) == i)[0]
        if len(class_indices) > 0:
            class_correct = sum([predicted_labels[j] == i for j in class_indices])
            class_accuracies[idx_to_label[i]] = class_correct / len(class_indices)
        else:
            class_accuracies[idx_to_label[i]] = 0
    
    results = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'confusion_matrix': cm,
        'class_report': class_report,
        'per_class_accuracy': class_accuracies,
        'logits': all_logits
    }
    
    # Print summary
    print(f"Overall Accuracy: {accuracy:.4f}")
    print("Per-class accuracy:")
    for label, acc in class_accuracies.items():
        print(f"  Choice {label}: {acc:.4f}")
    
    return results

In [None]:
random_model_results = evaluate(random_model, test_dataloader, device)

In [None]:
pretrained_model_results = evaluate(pretrained_model, test_dataloader, device)

In [None]:
# For the validation set (1 example)
results = process_commonsense_qa_for_deepseek(valid, model_deepseek, tokenizer_deepseek, num_examples=1)

In [10]:
print(results['questions'])
print(results['prompts'])
print(results['responses'])
print(results['correct_answers'])

['What is a well known way for couples  of celebrating a marriage?']
['Question: What is a well known way for couples  of celebrating a marriage?\nQuestion concept: celebrating\nChoices: A. eat cake, B. getting drunk, C. having sex, D. cleaning rooms, E. drink too much\nThe correct answer is:']
['Question: What is a well known way for couples  of celebrating a marriage?\nQuestion concept: celebrating\nChoices: A. eat cake, B. getting drunk, C. having sex, D. cleaning rooms, E. drink too much\nThe correct answer is: C. having sex\nThe correct answer is: C. having sex\nThe correct answer is: C. having sex\nThe correct answer is: C. having sex\nThe correct answer is: C. having sex\nThe correct answer is: C. having sex\nThe correct answer is: C. having sex\nThe correct answer is: C. having sex\nThe correct answer is: C. having sex\nThe correct answer is: C. having sex\nThe correct answer is:']
['C']


# **Interpretation**

# **Tools used**

### **Adjust this section before submitting**

1. **Programming Environment**
   - Python 3.12.8
   - Jupyter Notebook

2. **Machine Learning and Deep Learning**
   - PyTorch (neural network development)
   - Hugging Face Datasets (data management)
   - NLTK (natural language preprocessing)
   - FastText (pre-trained word embeddings, 300-dimensional vectors)

3. **Data Manipulation and Analysis**
   - NumPy (numerical computing)
   - Pandas (data structuring and manipulation)
   - Scikit-learn (potential additional machine learning utilities)

4. **Visualization and Tracking**
   - Matplotlib (basic plotting)
   - Seaborn (statistical data visualization)
   - Weights & Biases (experiment tracking and logging)
     * Tracked metrics: training loss, accuracy, learning rates
     * Logged hyperparameter configurations
     * Enabled comparative analysis across model runs

5. **Computational Infrastructure**
   - CUDA-enabled GPU acceleration
   - GPU-optimized PyTorch operations
   - Efficient parallel computing for model training

6. **Dataset and Benchmarking**
   - CommonsenseQA dataset (Hugging Face)
   - Standard benchmark for commonsense reasoning tasks

7. **Additional Libraries**
   - Gensim (word vector processing)
   - tqdm (progress bar visualization)
   - datetime (experiment timestamping)

8. **AI-Tools**
   - Claude 3.5 Sonnet: Utilized as a coding assistant for debugging, optimization and documentation.
   - GPT-4-turbo: Assisted in drafting and refining documentation, helping with structure and phrasing.
   - Copilot: Used for quick inserts, when recommendation was suitable for what I was planning to do.

9. **Sources**
   - Transformer architecture: https://medium.com/data-science/build-your-own-transformer-from-scratch-using-pytorch-84c850470dcb
   - Deepseek implementation: https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct