# PoLitBert - Polish RoBERT'a model 

## Model training experiments' protocols.

Training environment details:
* Pytorch 1.5
* Apex 
* CUDA 10.2
* fairseq 0.9


Experiments were additionally compared in a separate [research log](https://docs.google.com/spreadsheets/d/1fBhELqDB1kAxLCBvzeVM4OhqO4zx-meRUVljK1YZfF8/edit#gid=0)

* Experiment 1 - linear decay, 50k updates
 * linear schedule peek_lr=5e-4  updates=50000, bsz=8192, test convergence speed of linear schedule, try to find optimal speed https://tensorboard.dev/experiment/pNCxXW9zQKKEoxkNN1LeKg/#scalars&tagFilter=lr%7C.*loss%24%7Cppl
* Experiment 2 - cyclic triangular, 50k updates
 * cyclic triangular schedule updates=50000, bsz=8192, cyclic step=5000 test convergence speed of linear schedule, try to find optimal speed https://tensorboard.dev/experiment/pNCxXW9zQKKEoxkNN1LeKg/#scalars&tagFilter=lr%7C.*loss%24%7Cppl
* Experiment 3 - cyclic cosine, 50k updates
 * TODO: EXPERIMENT_DESCRIPTION
* Experiment 4 - cyclic cosine, 50k updates
 * cyclic cosine schedule,  updates=50000, bsz=8192, cyclic step=2500- test convergence speed of linear schedule, try to find optimal speed, should be similar to tirangular schedule in 5000 steps goes up and down with lr https://tensorboard.dev/experiment/SY64gY46SKq7wGohxgjlgg/#scalars&tagFilter=lr%7C.*loss%24%7Cppl after 23k steps experiment was stopped, loss jumped and plateau
* Experiments 5, 6, 7 - cyclic cosine, 50k updates
 * TODO: EXPERIMENT_DESCRIPTION
* Experiment 8 - cyclic triangular, 125k updates
 * TODO: EXPERIMENT_DESCRIPTION
* Experiment 9 - cyclic cosine, 125k updates
 * TODO: EXPERIMENT_DESCRIPTION
* Experiment 10 - linear, 125k updates
 * TODO: EXPERIMENT_DESCRIPTION
* Experiment 11 - vocab50k, linear, 50k updates
 * TODO: EXPERIMENT_DESCRIPTION

---

### Experiment 1 - linear decay, 50k updates

Vocab: 32k tokens <br>
Train on: AWS p3.16xlarge

Efective batch size = MAX_SENTENCES\*UPDATE_FREQ\*num_gpu = 16\*64\*8 = 8192

First experiment with linear scheduler, run for 50k updates.

```
TOTAL_UPDATES=50000     # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=64          # Increase the batch size 16x

DATA_DIR=./data/wiki_books_oscar/vocab32k/
SAVE_DIR=./checkpoints/wiki_books_oscar_32k_linear/
LOGS_DIR=./checkpoints/wiki_books_oscar_32k_linear/logs/

fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --skip-invalid-size-inputs-valid-test \
    --save-dir $SAVE_DIR --tensorboard-logdir $LOGS_DIR --keep-last-epochs 10 \
    --ddp-backend=no_c10d

```

```
tensorboard dev upload --logdir $LOGS_DIR \
    --name "PoLitBert - Polish RoBERT'a model, exp. #1" \
    --description "- linear decay, 50k updates, vocab32k, --save-dir ${SAVE_DIR}"

```

### Experiment 2 - cyclic triangular, 50k updates

Vocab: 32k tokens <br>
Train on: AWS p3.16xlarge

Efective batch size = MAX_SENTENCES\*UPDATE_FREQ\*num_gpu = 16\*64\*8 = 8192

Cyclic triangular schedule, 5000 steps for rise to peek lr and fall to base lr,
after each 5k steps shrink peak and base lr.

```
TOTAL_UPDATES=50000     # Total number of training steps
STEP_SIZE=5000
BASE_LR=0.0001
PEAK_LR=0.001           # Peak learning rate, adjust as needed
LR_SHRINK=0.8
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=64          # Increase the batch size 16x

DATA_DIR=./data/wiki_books_oscar/vocab32k/
SAVE_DIR=./checkpoints/wiki_books_oscar_32k_tri/
LOGS_DIR=./checkpoints/wiki_books_oscar_32k_tri/logs/

fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler triangular --lr $BASE_LR --max-lr $PEAK_LR \
    --lr-period-updates $STEP_SIZE --lr-shrink $LR_SHRINK --shrink-min \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --skip-invalid-size-inputs-valid-test \
    --tensorboard-logdir $LOGS_DIR --log-format simple --log-interval 1  --save-dir $SAVE_DIR \
    --ddp-backend=no_c10d

```

```
tensorboard dev upload --logdir $LOGS_DIR \
    --name "PoLitBert - Polish RoBERT'a model, exp. #2" \
    --description "- cyclic triangular, 50k updates, vocab32k, --save-dir ${SAVE_DIR}"

```

### Experiment 3 - cyclic cosine, 50k updates

Vocab: upper 32k tokens <br>
Train on: AWS p3.16xlarge

Efective batch size = MAX_SENTENCES*UPDATE_FREQ*num_gpu = 16*64*8 = 8192

Cyclic cosine schedule, 5000 steps for rise to peek lr and fall to base lr, after each 5k steps shrink
peak and base lr.

```
TOTAL_UPDATES=50000     # Total number of training steps
STEP_SIZE=2500
WARMUP_UPDATES=2500     # same as triangular
BASE_LR=0.0001
PEAK_LR=0.001           # Peak learning rate, adjust as needed
LR_SHRINK=0.8           #
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=64          # Increase the batch size 16x

DATA_DIR=./data/wiki_books_oscar/vocab32k/
SAVE_DIR=./checkpoints/wiki_books_oscar_32k_cos1/
LOGS_DIR=./checkpoints/wiki_books_oscar_32k_cos1/logs/


fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler cosine --lr $BASE_LR --max-lr $PEAK_LR \
    --warmup-updates $WARMUP_UPDATES  \
    --lr-period-updates $STEP_SIZE --t-mult 1 --lr-shrink $LR_SHRINK  \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --skip-invalid-size-inputs-valid-test \
    --tensorboard-logdir $LOGS_DIR --log-format simple --log-interval 1  --save-dir $SAVE_DIR \
    --ddp-backend=no_c10d

```

```
tensorboard dev upload --logdir $LOGS_DIR \
    --name "PoLitBert - Polish RoBERT'a model, exp. #3" \
    --description "- cyclic cosine, 50k updates, vocab32k, step=2500 --save-dir ${SAVE_DIR}"

```


### Experiment 4 - cyclic cosine, 50k updates

Vocab: 32k tokens <br>
Train on: AWS p3.16xlarge

Effective batch size = MAX_SENTENCES\*UPDATE_FREQ\*num_gpu = 16\*64\*8 = 8192

Cyclic cosine schedule, 5000 steps for rise to peek lr and fall to base lr,
after each 5k steps shrink peak and base lr.

```
TOTAL_UPDATES=50000     # Total number of training steps
STEP_SIZE=2500
WARMUP_UPDATES=2500     # Same as triangular
BASE_LR=0.0001
PEAK_LR=0.0005          # Peak learning rate, adjust as needed
LR_SHRINK=0.8
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=64          # Increase the batch size 16x

DATA_DIR=./data/wiki_books_oscar/vocab32k/
SAVE_DIR=./checkpoints/wiki_books_oscar_32k_cos1_2/
LOGS_DIR=./checkpoints/wiki_books_oscar_32k_cos1_2/logs/

fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler cosine --lr $BASE_LR --max-lr $PEAK_LR \
    --warmup-updates $WARMUP_UPDATES  \
    --lr-period-updates $STEP_SIZE --t-mult 1  --lr-shrink $LR_SHRINK  \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --skip-invalid-size-inputs-valid-test \
    --tensorboard-logdir $LOGS_DIR --log-format simple --log-interval 1  --save-dir $SAVE_DIR \
    --ddp-backend=no_c10d --num-workers 2

```

```
tensorboard dev upload --logdir $LOGS_DIR \
    --name "PoLitBert - Polish RoBERT'a model, exp. #4" \
    --description "- cyclic cosine, 50k updates, vocab32k, step=2500 half lr=0.0005, --save-dir ${SAVE_DIR}"

```

### Experiments 5, 6, 7 - cyclic cosine, 50k updates

Vocab: upper 32k tokens <br>
Train on: AWS p3.16xlarge

Efective batch size = MAX_SENTENCES*UPDATE_FREQ*num_gpu = 16*64*8 = 8192

Cyclic cosine schedule, 1000 steps for rise to peek lr and fall to base lr.

```
TOTAL_UPDATES=50000     # Total number of training steps
STEP_SIZE=1000
WARMUP_UPDATES=1000
BASE_LR=0.0001
PEAK_LR=0.001           # Peak learning rate, adjust as needed
LR_SHRINK=0.8
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=64          # Increase the batch size 16x

DATA_DIR=./data/wiki_books_oscar/vocab32k/
SAVE_DIR=./checkpoints/wiki_books_oscar_32k_cos1_4/
LOGS_DIR=./checkpoints/wiki_books_oscar_32k_cos1_4/logs/


fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.9 \
    --lr-scheduler cosine --lr $BASE_LR --max-lr $PEAK_LR \
    --warmup-updates $WARMUP_UPDATES  \
    --lr-period-updates $STEP_SIZE --t-mult 2  --lr-shrink $LR_SHRINK  \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --skip-invalid-size-inputs-valid-test \
    --tensorboard-logdir $LOGS_DIR --log-format simple --log-interval 1  --save-dir $SAVE_DIR \
    --ddp-backend=no_c10d --num-workers 2

```

```
tensorboard dev upload --logdir $LOGS_DIR \
    --name "PoLitBert - Polish RoBERT'a model, exp. #6"  \
    --description "- cyclic cosine, 50k updates, step=1000, t-mult=2, clip-norm=0.9, --save-dir ${SAVE_DIR}"

```

### Experiment 8 - cyclic triangular, 125k updates

Vocab: 32k tokens <br>
Train on: AWS p3.16xlarge

Efective batch size = MAX_SENTENCES\*UPDATE_FREQ\*num_gpu = 16\*64\*8 = 8192

```
TOTAL_UPDATES=125000    # Total number of training steps
STEP_SIZE=5000
BASE_LR=0.0001
PEAK_LR=0.001           # Peak learning rate, adjust as needed
LR_SHRINK=0.8
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=64          # Increase the batch size 16x

DATA_DIR=./data/wiki_books_oscar/vocab32k/
SAVE_DIR=./checkpoints/wiki_books_oscar_32k_tri_full/
LOGS_DIR=./checkpoints/wiki_books_oscar_32k_tri_full/logs/

fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler triangular --lr $BASE_LR --max-lr $PEAK_LR \
    --lr-period-updates $STEP_SIZE --lr-shrink $LR_SHRINK --shrink-min \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --skip-invalid-size-inputs-valid-test \
    --tensorboard-logdir $LOGS_DIR --log-format simple --log-interval 1  --save-dir $SAVE_DIR \
    --ddp-backend=no_c10d

```

```
tensorboard dev upload --logdir $LOGS_DIR \
    --name "PoLitBert - Polish RoBERT'a model, exp. #8" \
    --description "- cyclic triangular, 125k updates, vocab32k, --save-dir ${SAVE_DIR}"

```

### Experiment 9 - cyclic cosine, 125k updates

Vocab: upper 32k tokens <br>
Train on: AWS p3.16xlarge

```
TOTAL_UPDATES=125000    # Total number of training steps
STEP_SIZE=1000
WARMUP_UPDATES=5000 
BASE_LR=0.00001
PEAK_LR=0.0007          # Peak learning rate, adjust as needed
LR_SHRINK=0.7
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=64          # Increase the batch size 16x

DATA_DIR=./data/wiki_books_oscar/vocab32k/
SAVE_DIR=./checkpoints/wiki_books_oscar_32k_cos1_5/
LOGS_DIR=./checkpoints/wiki_books_oscar_32k_cos1_5/logs/

fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.9 \
    --lr-scheduler cosine --lr $BASE_LR --max-lr $PEAK_LR \
    --warmup-updates $WARMUP_UPDATES  \
    --lr-period-updates $STEP_SIZE --t-mult 2  --lr-shrink $LR_SHRINK  \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --skip-invalid-size-inputs-valid-test \
    --tensorboard-logdir $LOGS_DIR --log-format simple --log-interval 1  --save-dir $SAVE_DIR \
    --ddp-backend=no_c10d --num-workers 2

```

```
tensorboard dev upload --logdir $LOGS_DIR \
    --name "PoLitBert - Polish RoBERT'a model, exp. #9"  \
    --description "- cyclic cosine, 125k updates, vocab32k, t-mult=2 clip-norm=0.9, --save-dir ${SAVE_DIR}"

```

### Experiment 10 - linear, 125k updates

Vocab: 32k tokens <br>
Train on: AWS p3.16xlarge

Efective batch size = MAX_SENTENCES\*UPDATE_FREQ\*num_gpu = 16\*64\*8 = 8192

```
TOTAL_UPDATES=125000    # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.001           # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=64          # Increase the batch size 16x

DATA_DIR=./data/wiki_books_oscar/vocab32k/
SAVE_DIR=./checkpoints/wiki_books_oscar_32k_linear_full/
LOGS_DIR=./checkpoints/wiki_books_oscar_32k_linear_full/logs/

fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.9 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --skip-invalid-size-inputs-valid-test \
    --save-dir $SAVE_DIR --tensorboard-logdir $LOGS_DIR  \
    --ddp-backend=no_c10d

```

```
tensorboard dev upload --logdir $LOGS_DIR \
    --name "PoLitBert - Polish RoBERT'a model, exp. #10 "  \
    --description "- linear, 125k updates, vocab32k, clip-norm=0.9, --save-dir ${SAVE_DIR}"

```

### Experiment 11 - vocab50k, linear, 50k updates

Vocab: 50k tokens
Train on: AWS p3.16xlarge

Efective batch size = MAX_SENTENCES\*UPDATE_FREQ\*num_gpu = 16\*64\*8 = 8192

```
TOTAL_UPDATES=50000     # Total number of training steps
WARMUP_UPDATES=10000    # Warmup the learning rate over this many updates
PEAK_LR=0.001           # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512   # Max sequence length
MAX_POSITIONS=512       # Num. positional embeddings (usually same as above)
MAX_SENTENCES=16        # Number of sequences per batch (batch size)
UPDATE_FREQ=64          # Increase the batch size 16x

DATA_DIR=./data/wiki_books_oscar/vocab50k/
SAVE_DIR=./checkpoints/wiki_books_oscar_50k_linear50k/
LOGS_DIR=./checkpoints/wiki_books_oscar_50k_linear50k/logs/

fairseq-train --fp16 $DATA_DIR \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9,0.98)' --adam-eps 1e-6 --clip-norm 0.9 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES --update-freq $UPDATE_FREQ \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 --skip-invalid-size-inputs-valid-test \
    --save-dir $SAVE_DIR --tensorboard-logdir $LOGS_DIR  \
    --ddp-backend=no_c10d

```

```
tensorboard dev upload --logdir $LOGS_DIR \
    --name "PoLitBert - Polish RoBERT'a model, exp. #11"  \
    --description "- linear, 50k updates, vocab50k, clip-norm=0.9, --save-dir ${SAVE_DIR}"

```

