### **Technical Report on Model Training Implementation**  

#### **Introduction**  
I replaced the pretrained model used in that lab with the one we utilize for training on the C4 dataset and I adapted the dataset sampling process by implementing `DistributedSampler`, ensuring that different samples are assigned to each worker in a distributed environment.  

A critical modification was replacing epoch-based training with batch-based training, meaning that the training loop iterates only over batches instead of complete epochs.

I aimed to familiarize myself with MLflow, a tool that enables logging in a local network-hosted environment. To achieve this, I integrated MLflow logging into my implementation.  

To launch MLflow UI on localhost, navigate to the directory containing the `mlruns/` folder and execute:  
```bash
mlflow ui
```
This allows to save all of the plots locally, so I added it to my solution.

---

#### **Plot of learning rate**

![Learning rate and loss function for training on 4400 steps.](imgs/plots_lr.png)

---

#### **Model Parameter Count and Training Steps Calculation**  
As required by the task, I computed the total number of model parameters, which is **2.89 × 10⁷**. This value is also logged in **MLflow** as part of the training metadata.  

Using the scaling rule:
$$
D = 20N
$$
and number of tokens per training step:
$$
\text{batch size} \times \text{sequence length}
$$
the total, optimal number of training steps is calculated as:

$$
\text{number of training steps} = \frac{20 \times 2.89 \times 10^7}{256 \times 256} = 8800
$$
 
In my implementation, each worker processes batches of size 256. In a dual-GPU setup, where two independent GPUs process batches in parallel, we must divide the total number of training steps by 2, leading to a final training step count of:

$$
\frac{8800}{2} = 4400.
$$

---

#### **Memory allocated and reserved across gpus**

![Comparison of memory allocations](imgs/memory_allocated.png)

![Comparison of memory reserving](imgs/memory_reserved.png)


#### **Plots of learning rate for different initial learning rate**

It's my plots of validation loss for different learning rates.:

![Comparison of training with different learning rate](imgs/different_rl.png)

As we can see, big learning rate has a huge, negative impact of initial steps of the training. Let's see plots without first 300 steps:

![Comparison of training with different learning rate without first 300 steps](imgs/different_rl_zoom.png)

Final validation loss of given learning rates is:

- lr = 1e-2, valid_loss = 5.9875,
- lr = 1e-3, valid_loss = 4.7323,
- lr = 1e-4, valid_loss = 6.1479.
---

#### **Plot of learning rate that is stitched with 2 experiments**

For this experiment, I decided to change initialization of scheduler to see, what happens on the begining of the training with optimal training steps = 4400 and I reduced the model size to run all of the 'small' experiments as this one. In that case, the number of validation steps remain the same, so initial value of loss function is much different than in experiments with full series of steps, but we can observe the continuation of the function between the save and load images. In the first picture, we can see, that the most of steps show warmup, which has 0.01 * 4400 = 44. In the second picture we can see shape of sin function, what is expected.

![Learning rate from initialization to saving at 50 steps.](imgs/save.png)

![Learning rate from loading to 100 steps.](imgs/load.png)

![Validation loss function from initialization to saving at 50 steps.](imgs/valid_save.png)

![Validation loss function from loading to 100 steps.](imgs/valid_load.png)

