In [2]:
from src import model_generation_flops, model_training_flops, model_evaluation_flops

# **Planning the flops budget**
## Pre Experimentation plan

- The previous notebooks calculations have shown that we have a very limited flops budget (approximately 15,000 training steps (512 tokens and a batch size of 4))

- The two major sources of flops are:
    - Training the model
    - Autoregressive Generation (this is no longer included)

- These calculation all assume a lora rank of 4  (Lora makes very little difference on overall scale)

## Post Experimentation

- When running future evaluations, training and tests the functions have been built to save all metrics to .json files including a breakdown of the flops used
- The functions calculate these dynamically ie account for the exact number of batches used to evaluate on, the exact number of steps taken considering early stopping etc
- These can then be read automatcially by the `flops_in_folder` 
- A summary is provided at the top of each notebook and a final value at the end

# PLAN:

## Evaluating the Model 
- Using the Test Set Cross Entropy Loss

### Approx 0.2414% 
- I still assume evaluation loss is in the buget - this is preformed on 150 trajectories
- Approx 450 batches of 512 tokens

In [3]:
valuation_flops, evaluation_lora_flops = model_evaluation_flops(no_tokens = 512, lora_ranks = 0 , batch_size = 400)

Total FLOPs for evaluation: 2.4137e+14
Total FLOPs from LoRA adaptation: 0
Percentage of Total FLOPs Budget:   0.24137 %


## Evaluating the model with MSE

### Not accounted for (if it was - would be very large(180%))

- To be able to show how errors accumulate over time,and give a direct and intuitive measure of the model’s performance on predicting actual values rather than token probabilities.
- We will do this twice:
    - The baseline untrained Qwen model 
    - The Fully Trained Optimised model

- We do this with the full set of examples from the full training set (150 time series trajectories)

### Change of flops limit - No longer account for autoregressive generation on FLOPS limit
- I orginally had only planned to use this on 25 examples and for before and after as this would have been 23% of my flops budget 
- However it was later changed so that this was not included in the flops budget so it was ramped up to the full 150 on the training set. 

In [4]:
total_analysis_flop, _ = model_generation_flops(tokens_given = 970, tokens_generated = 290, lora_ranks = 0, randomness = False)

# Using twice before and after the model is trained - 25 exmaples
print(f"Total FLOPS for 25 examples before and after: {(float(total_analysis_flop) * 2 * 150):.3e}")
print(f"Percentage of Budget used: {(float(total_analysis_flop) * 2 * 150 / 1e17 * 150)}%")

Total FLOPs for generating 290 tokens: 3.9863e+14
Total FLOPs from LoRA adaptation: 0
Percentage of Total FLOPs Budget:   0.39863 %
Total FLOPS for 25 examples before and after: 1.196e+17
Percentage of Budget used: 179.38456605142653%


## Training/ Tuning Plan
### (Approx 6.4 %)

## Training with the default hyperparameters
- It was first considered building this cost within the hyperparameter tuning (ie using one of the runs)
- However it was decided that hyperparameter tuning should be done with a token length of 256 (thus further reduces the overall cost)
- Thus this excluded the default values (token length of 512) and must be run seperatly

In [5]:
# Training if no early stopping
print('For a single test the potenital flops of training steps')
total_flops, _ = model_training_flops(no_tokens = 512, lora_ranks = 4, batch_size = 4, num_steps_training = 800)
# Evaluation
print('')
print('For a single mid training evaluation (repeated up to 20 times)')
valuation_flops_inter, _ = model_evaluation_flops(no_tokens = 512, lora_ranks = 4 , batch_size = 25)
print('')
print('For a final evaluation')
valuation_flops_final, _ = model_evaluation_flops(no_tokens = 512, lora_ranks = 4 , batch_size = 450)
print('')
print('For a Total evaluation:')
print(f"{(20 * valuation_flops_inter + valuation_flops_final):.3e}")
print('')
print(f"Total FLOPs of single test:")
print(f"{(20 * valuation_flops_inter + valuation_flops_final + total_flops):.3e}")
print('')
print('Across all potenital tests:')
# Times be 14 to account for the runs with 512 and 768 tokens
print(f"{((20 * valuation_flops_inter + valuation_flops_final)):.3e}")
print(f"Percentage of Budget used: {((20 * valuation_flops_inter + valuation_flops_final + total_flops) / 1e17 * 100):.3f}%")

For a single test the potenital flops of training steps
Total FLOPs for training: 5.7964e+15
Total FLOPs from LoRA adaptation: 3.5927e+12
Percentage of Total FLOPs Budget:   5.7964 %

For a single mid training evaluation (repeated up to 20 times)
Total FLOPs for evaluation: 1.5095e+13
Total FLOPs from LoRA adaptation: 9.3561e+09
Percentage of Total FLOPs Budget:   0.015095 %

For a final evaluation
Total FLOPs for evaluation: 2.7171e+14
Total FLOPs from LoRA adaptation: 1.6841e+11
Percentage of Total FLOPs Budget:   0.27171 %

For a Total evaluation:
5.736e+14

Total FLOPs of single test:
6.370e+15

Across all potenital tests:
5.736e+14
Percentage of Budget used: 6.370%


## Hyperparameter Tuning
### (Approx 44% -  Hopefully Less)
- For baseline testing I will use a token length of 256 as the self attention increases the compute with token length significantly and this can be final tuned at the end
- I will attempt to reduce the hyperparameter search area in as efficient way as possible using sub experiments
- However this budget will account for the worst case senario in which all hyperparameters combinations will have to be run, ie a grid search of **11 configurations**
- If parameters can be rulled out it will allow for further tuning in different ways

### **Up to 11**(Hopefully Less) configurations
- Testing for a training period of up to 800 training periods
- Using a batch size of 4 and 256 tokens
- Evaluate every 25 steps on a subbatch of 25
- Evaluate on the end with the full validation set (150 trajectories)
- (Note in the code I multiple by 14 this is because 2 tests use token indexes of 512 and 768 leading to (equivelent of 3 runs of addition compute))

In [6]:
# Training if no early stopping
print('For a single test the potenital flops of training steps')
total_flops, _ = model_training_flops(no_tokens = 256, lora_ranks = 4, batch_size = 4, num_steps_training = 800)
# Evaluation
print('')
print('For a single mid training evaluation (repeated up to 20 times)')
valuation_flops_inter, _ = model_evaluation_flops(no_tokens = 256, lora_ranks = 4 , batch_size = 25)
print('')
print('For a final evaluation')
valuation_flops_final, _ = model_evaluation_flops(no_tokens = 256, lora_ranks = 4 , batch_size = 450)
print('')
print('For a Total evaluation:')
print(f"{(20 * valuation_flops_inter + valuation_flops_final):.3e}")
print('')
print(f"Total FLOPs of single test:")
print(f"{(20 * valuation_flops_inter + valuation_flops_final + total_flops):.3e}")
print('')
print('Across all potenital tests:')
# Times be 14 to account for the runs with 512 and 768 tokens
print(f"{(14*(20 * valuation_flops_inter + valuation_flops_final)):.3e}")
print(f"Percentage of Budget used: {(14* (20 * valuation_flops_inter + valuation_flops_final + total_flops) / 1e17 * 100):.3f}%")

For a single test the potenital flops of training steps
Total FLOPs for training: 2.8416e+15
Total FLOPs from LoRA adaptation: 1.7964e+12
Percentage of Total FLOPs Budget:   2.8416 %

For a single mid training evaluation (repeated up to 20 times)
Total FLOPs for evaluation: 7.3999e+12
Total FLOPs from LoRA adaptation: 4.678e+09
Percentage of Total FLOPs Budget:   0.0073999 %

For a final evaluation
Total FLOPs for evaluation: 1.332e+14
Total FLOPs from LoRA adaptation: 8.4205e+10
Percentage of Total FLOPs Budget:   0.1332 %

For a Total evaluation:
2.812e+14

Total FLOPs of single test:
3.123e+15

Across all potenital tests:
3.937e+15
Percentage of Budget used: 43.719%


## Final Training
### Approx 48% (with early stopping hopefully less)
- Training period of up to 4000 training periods (early stopping will likely get it before this)
- Using a batch size of 4 and 512 tokens
- Evaluate every 50 steps on a subbatch of 25
- Evaluate on the end with full validation set

In [7]:
# Training if no early stopping
print('For all potential training steps (4000)')
total_flops, _ = model_training_flops(no_tokens = 768, lora_ranks = 4, batch_size = 4, num_steps_training = 4000)
# Evaluation
print('')
print('For a single mid training evaluation (repeated up to 160 times)')
valuation_flops_inter, _ = model_evaluation_flops(no_tokens = 768, lora_ranks = 4 , batch_size = 25)
print('')
print('For a final evaluation')
valuation_flops_final, _ = model_evaluation_flops(no_tokens = 768, lora_ranks = 4 , batch_size = 200)
print('')
print('For a Total evaluation:')
print(f"{(160 * valuation_flops_inter + valuation_flops_final):.3e}")
print(f"Total FLOPs:")
print(f"{(160 * valuation_flops_inter + valuation_flops_final + total_flops):.3e}")
print(f"Percentage of Budget used: {((160 * valuation_flops_inter + valuation_flops_final + total_flops) / 1e17 * 100):.3f}%")

For all potential training steps (4000)
Total FLOPs for training: 4.4323e+16
Total FLOPs from LoRA adaptation: 2.6946e+13
Percentage of Total FLOPs Budget:   44.323 %

For a single mid training evaluation (repeated up to 160 times)
Total FLOPs for evaluation: 2.3085e+13
Total FLOPs from LoRA adaptation: 1.4034e+10
Percentage of Total FLOPs Budget:   0.023085 %

For a final evaluation
Total FLOPs for evaluation: 1.8468e+14
Total FLOPs from LoRA adaptation: 1.1227e+11
Percentage of Total FLOPs Budget:   0.18468 %

For a Total evaluation:
3.878e+15
Total FLOPs:
4.820e+16
Percentage of Budget used: 48.201%


## Final Evaluation
- Same calculations as the intial model

### Cross Entropy Loss (0.2414%)
### Autoregressive Generation - Very High not included