# Train HTR recognition models with Kraken

If you are lacking infrastructure to run large training sets for Kraken, *Google Colab* is a good alternative. By default all your data will be deleted at the end of your session, so make sure to download and save whatever files you wish to hold on to.

There are basically two ways of training HTR modells with Kraken: You can train a model *from scratch* or *fine-tune* a pre-trained model. This notebook will take you through both of these options, make sure you only follow the sections needed for your project.

## Installation
For either option first install Kraken in *Colab*. This may take a few minutes.

In [None]:
!pip install kraken==5.2.9 > pip_output.txt

## Prepare training directory
Before you start training you need to upload the data you would like to train your model on. This includes the image files as well as all your prepared transcriptions in an xml formate. This notebook assumes you are using ALTO but you can adjust it for using PAGE a little further down the line.

Save all your image and xml files to a folder and adjust the path in the following line. Then run that line.

In [None]:
!find PATH/ -type f -name "*.xml" > output.txt

## Training a model from scratch

Use this section to train a model **from scratch**. If you wish to fine-tune a model, skip to the next section.

First, you may wish to provide a random split for your material. Run the following line to divert 80% to the training set, 10% to the test set, and 10% to the validation set.

If your ground truth comes in PAGE xml, change `alto` to `page`.

In [None]:
!ketos compile -F output.txt --random-split 0.8 0.1 0.1 -f alto

Now adjust the following line by setting the folder where you would liker your models to be saved as well as the prefix (i.e. the name) for your model. This should look somewhat like `folder_to_store/model_prefix`.

Here, we added that training should run for at least 30 epochs before stopping. If you wish to not specify, delete `--min-epochs 30`.

If you have set up GPU, include `-d cuda:0` after `binary`. If you don't specify, Kraken will run on CPU.

Now run this line.

In [None]:
!ketos train -f binary -o PATH/PREFIX --min-epochs 30 dataset.arrow

Training will take quite some time while Kraken keeps saving model files to the assigned folder. 

When training ends, choose the model best suited and download it. Also, download whatever files you may decide to keep as *Colab* will delete all of your data by default.

## Fine-tune a pre-trained model

If you wish to use a base model and fine-tune that with your own data, save that model to *Colab*. You will need to specify the path to it later. First, run this line.

In [None]:
!ketos compile -F output.txt -f alto

Now adjust the path to your base model and add the prefix (i.e., the name of yourmodel without `.mlmodel`) for your fine-tuned model. Then run this line.

In [None]:
!ketos train -f binary --resize both -i PATH/MODEL -o PATH/PREFIX_NEW_MODEL dataset.arrow

Training will take quite some time while Kraken keeps saving model files to the assigned folder.

When training ends, choose the model best suited and download it. Also, download whatever files you may decide to keep as *Colab* will delete all of your data by default.