# Hyper Parameter search

In [1]:
from src.qwen import load_qwen
model_qwen, tokenizer = load_qwen()

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


In [1]:
import torch
import torch.nn as nn

In [2]:
from src.set_up_lora import*
from src.preprocessor import*

  from .autonotebook import tqdm as notebook_tqdm
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


Hyper Parameters that we want to search for:
- $r = (2,4,8)$ "rank"
- $lr = (10^{-5}, 5 \times 10^{-5}, 10^{4})$ "learning rate"

The nested loop below will be very expensive in terms of computation, this will load Qwen2.5 nine times, if your local machine struggles to reload Qwen2.5 that many times, use the alternative code below.

In [4]:
from src.qwen import load_qwen

In [5]:

results = []
ranks = [2, 4, 8]
lrs = [1e-5, 5e-5, 1e-4]

for r in ranks:
    for lr in lrs:
        print(f"Training with r={r}, lr={lr}")
        model, tokenizer = load_qwen()
        trained_model, final_loss = train_lora_model(model, tokenizer, lora_rank=r, learning_rate=lr, train_steps=500)
        results.append({"rank": r, "learning_rate": lr, "final_loss": final_loss})
        print(f"-> final loss: {final_loss:.4f}")


Training with r=2, lr=1e-05


Steps 0:  44%|████▎     | 499/1142 [02:04<02:40,  4.01it/s]


-> final loss: 1.2310
Training with r=2, lr=5e-05


Steps 0:  44%|████▎     | 499/1142 [02:12<02:50,  3.76it/s]


-> final loss: 0.9345
Training with r=2, lr=0.0001


Steps 0:  44%|████▎     | 499/1142 [02:01<02:36,  4.10it/s]


-> final loss: 1.0758
Training with r=4, lr=1e-05


Steps 0:  44%|████▎     | 499/1142 [02:02<02:38,  4.06it/s]


-> final loss: 1.0886
Training with r=4, lr=5e-05


Steps 0:  44%|████▎     | 499/1142 [02:12<02:51,  3.76it/s]


-> final loss: 0.6507
Training with r=4, lr=0.0001


Steps 0:  44%|████▎     | 499/1142 [02:10<02:48,  3.82it/s]


-> final loss: 0.8168
Training with r=8, lr=1e-05


Steps 0:  44%|████▎     | 499/1142 [02:12<02:50,  3.76it/s]


-> final loss: 1.1950
Training with r=8, lr=5e-05


Steps 0:  44%|████▎     | 499/1142 [02:10<02:48,  3.82it/s]


-> final loss: 1.0317
Training with r=8, lr=0.0001


Steps 0:  44%|████▎     | 499/1142 [02:09<02:47,  3.84it/s]

-> final loss: 1.0108





In [6]:
import pandas as pd
HP_search_results_df = pd.DataFrame(results)
print(HP_search_results_df)


   rank  learning_rate  final_loss
0     2        0.00001    1.230973
1     2        0.00005    0.934489
2     2        0.00010    1.075752
3     4        0.00001    1.088630
4     4        0.00005    0.650662
5     4        0.00010    0.816825
6     8        0.00001    1.194976
7     8        0.00005    1.031744
8     8        0.00010    1.010849


### Alternative (Use only if the code above keeps crashing the kernel)

In [7]:
import gc
import torch
from src.qwen import load_qwen
from src.set_up_lora import*

In [8]:
_,tokenizer = load_qwen()

In [9]:
_, val_texts, _ = load_and_preprocess("data/lotka_volterra_data.h5")

In [10]:

results = []

ranks = [2, 4, 8]
lrs = [1e-5, 5e-5, 1e-4]

for r in ranks:
    for lr in lrs:
        print(f"\nTraining with r={r}, lr={lr}")

        # Load fresh model
        model, _ = load_qwen()
        trained_model, final_loss = train_lora_model(model, tokenizer, lora_rank=r, learning_rate=lr, train_steps=500)

        val_loss, _ = evaluate_loss_perplexity_val(trained_model, tokenizer, val_texts, 4)

        results.append({"rank": r, "learning_rate": lr, "Train Loss": final_loss, "Validation Loss": val_loss})
        print(f"-> Train loss: {final_loss:.4f}")
        print(f"-> Validation loss: {val_loss:.4f}")

        # Clean up to free GPU memory
        del model
        del trained_model
        torch.cuda.empty_cache()
        gc.collect()



Training with r=2, lr=1e-05


Steps 0:  44%|████▎     | 499/1142 [02:09<02:47,  3.84it/s]
Validating: 100%|██████████| 75/75 [00:09<00:00,  8.27it/s, avg_loss=1.1259]


-> Train loss: 1.1641
-> Validation loss: 1.1259

Training with r=2, lr=5e-05


Steps 0:  44%|████▎     | 499/1142 [02:11<02:49,  3.80it/s]
Validating: 100%|██████████| 75/75 [00:07<00:00,  9.42it/s, avg_loss=0.8970]


-> Train loss: 1.0710
-> Validation loss: 0.8970

Training with r=2, lr=0.0001


Steps 0:  44%|████▎     | 499/1142 [02:11<02:49,  3.79it/s]
Validating: 100%|██████████| 75/75 [00:09<00:00,  8.12it/s, avg_loss=0.8383]


-> Train loss: 1.0383
-> Validation loss: 0.8383

Training with r=4, lr=1e-05


Steps 0:  44%|████▎     | 499/1142 [02:05<02:41,  3.97it/s]
Validating: 100%|██████████| 75/75 [00:07<00:00, 10.26it/s, avg_loss=1.0203]


-> Train loss: 1.2273
-> Validation loss: 1.0203

Training with r=4, lr=5e-05


Steps 0:  44%|████▎     | 499/1142 [02:06<02:42,  3.96it/s]
Validating: 100%|██████████| 75/75 [00:09<00:00,  8.21it/s, avg_loss=0.8529]


-> Train loss: 0.8702
-> Validation loss: 0.8529

Training with r=4, lr=0.0001


Steps 0:  44%|████▎     | 499/1142 [02:00<02:34,  4.15it/s]
Validating: 100%|██████████| 75/75 [00:08<00:00,  8.70it/s, avg_loss=0.7795]


-> Train loss: 0.8465
-> Validation loss: 0.7795

Training with r=8, lr=1e-05


Steps 0:  44%|████▎     | 499/1142 [01:59<02:33,  4.18it/s]
Validating: 100%|██████████| 75/75 [00:07<00:00, 10.52it/s, avg_loss=0.9397]


-> Train loss: 1.0821
-> Validation loss: 0.9397

Training with r=8, lr=5e-05


Steps 0:  44%|████▎     | 499/1142 [02:01<02:36,  4.12it/s]
Validating: 100%|██████████| 75/75 [00:08<00:00,  8.79it/s, avg_loss=0.8051]


-> Train loss: 0.9173
-> Validation loss: 0.8051

Training with r=8, lr=0.0001


Steps 0:  44%|████▎     | 499/1142 [02:00<02:35,  4.14it/s]
Validating: 100%|██████████| 75/75 [00:08<00:00,  8.71it/s, avg_loss=0.7376]


-> Train loss: 0.7409
-> Validation loss: 0.7376


In [12]:
import pandas as pd
HP_search_results_df = pd.DataFrame(results)
print(HP_search_results_df)
HP_search_results_df.to_csv("hp_tuning_results/hp_tun_rank_lr.csv")

   rank  learning_rate  Train Loss  Validation Loss
0     2        0.00001    1.164133         1.125851
1     2        0.00005    1.071049         0.897050
2     2        0.00010    1.038254         0.838324
3     4        0.00001    1.227332         1.020328
4     4        0.00005    0.870162         0.852907
5     4        0.00010    0.846522         0.779493
6     8        0.00001    1.082127         0.939741
7     8        0.00005    0.917339         0.805061
8     8        0.00010    0.740861         0.737563


After determining best hyper parameters for "rank" and "learning rate", we can procede to determine which of the three context lengths $[128, 512, 768]$ perform the best for a maximun of 2000 RLPPP steps

In [13]:
context_lengths = [128, 512, 768]
best_r = 8
best_lr = 1e-4

for cl in context_lengths:
    print(f"\nTraining with context_lenghts")

    # Load fresh model
    model, _ = load_qwen()
    trained_model, final_loss = train_lora_model(model, tokenizer, lora_rank=best_r, learning_rate=best_lr, max_ctx_length=cl, train_steps=500)

    val_loss, _ = evaluate_loss_perplexity_val(trained_model, tokenizer, val_texts, 4)

    results.append({"context_lengths": lr, "Train Loss": final_loss, "Validation Loss": val_loss})
    print(f"-> Train loss: {final_loss:.4f}")
    print(f"-> Validation loss: {val_loss:.4f}")

    # Clean up to free GPU memory
    del model
    del trained_model
    torch.cuda.empty_cache()
    gc.collect()



Training with context_lenghts


Steps 0:  11%|█▏        | 499/4374 [00:38<04:59, 12.94it/s]
Validating: 100%|██████████| 75/75 [00:09<00:00,  8.05it/s, avg_loss=0.7386]


-> Train loss: 1.0504
-> Validation loss: 0.7386

Training with context_lenghts


Steps 0:  44%|████▎     | 499/1142 [02:10<02:48,  3.81it/s]
Validating: 100%|██████████| 75/75 [00:09<00:00,  7.57it/s, avg_loss=0.7543]


-> Train loss: 0.7791
-> Validation loss: 0.7543

Training with context_lenghts


Steps 0:  55%|█████▌    | 499/900 [14:26<11:36,  1.74s/it] 
Validating: 100%|██████████| 75/75 [00:22<00:00,  3.35it/s, avg_loss=0.7788]


-> Train loss: 0.9044
-> Validation loss: 0.7788


In [14]:
import pandas as pd
HP_search_results_df = pd.DataFrame(results)
print(HP_search_results_df)

HP_search_results_df.to_csv("hp_tuning_results/hp_tun_cl.csv")

    rank  learning_rate  Train Loss  Validation Loss  context_lengths
0    2.0        0.00001    1.164133         1.125851              NaN
1    2.0        0.00005    1.071049         0.897050              NaN
2    2.0        0.00010    1.038254         0.838324              NaN
3    4.0        0.00001    1.227332         1.020328              NaN
4    4.0        0.00005    0.870162         0.852907              NaN
5    4.0        0.00010    0.846522         0.779493              NaN
6    8.0        0.00001    1.082127         0.939741              NaN
7    8.0        0.00005    0.917339         0.805061              NaN
8    8.0        0.00010    0.740861         0.737563              NaN
9    NaN            NaN    1.050353         0.738644           0.0001
10   NaN            NaN    0.779086         0.754333           0.0001
11   NaN            NaN    0.904416         0.778757           0.0001
