# Train a LaTeX OCR model
In this brief notebook I show how you can finetune/train an OCR model.

I've opted to mix in handwritten data into the regular pdf LaTeX images. For that I started out with the released pretrained model and continued training on the slightly larger corpus.

In [None]:
!pip install pix2tex[train] -qq

In [None]:
import os
!mkdir -p LaTeX-OCR
os.chdir('LaTeX-OCR')

In [None]:
!pip install gpustat -q
!pip install opencv-python-headless==4.8.1.78 -U -q
!pip install --upgrade --no-cache-dir gdown -q

In [None]:
# check what GPU we have
!gpustat

In [None]:
import os
import shutil
import random
from glob import glob

# Create the necessary directories if they don't exist
!mkdir -p dataset/data
!mkdir -p dataset/data/images  # Ensure this directory is correctly referenced

# Download the datasets using gdown
!gdown -O dataset/data/HANDWRTN.zip --id 18-bFQV5m1Ir1pq8deQiAkw743KpBSBN3         # NEW DATA
!gdown -O dataset/data/pdf.zip --id 176PKaCUDWmTJdQwc-OfkO0y8t4gLsIvQ              # PDF IMAGES DATA
!gdown -O dataset/data/pdfmath.txt --id 1QUjX6PFWPa-HBWdcY-7bA5TRVUnbyS1D          # PDF MATH DATA

# Unzip the downloaded datasets
!unzip -q dataset/data/HANDWRTN.zip -d dataset/data
!unzip -q dataset/data/pdf.zip -d dataset/data

# Define the number of validation images you want
number_of_val_images = 1000

# Change directory to where the images are expected to be after unzipping
images_path = 'dataset/data/images'
val_images_path = 'dataset/data/valimages'

handwrtn_txt = '/content/LaTeX-OCR/dataset/data/HANDWRTN/HANDWRTN_math.txt'

# Check if the 'images' directory exists and has files
if os.path.exists(images_path) and os.path.isdir(images_path):
    # Get all the image files in the 'images' directory
    all_images = glob(os.path.join(images_path, '*'))

    # Shuffle the list of images
    random.shuffle(all_images)

    # Ensure the validation directory exists
    os.makedirs(val_images_path, exist_ok=True)

    # Move a subset of images to the 'valimages' directory
    for img in all_images[:number_of_val_images]:
        shutil.move(img, val_images_path)

    # The remaining files in 'images' are your training set
else:
    print(f"The directory {images_path} does not exist or is not a directory.")


Now we generate the datasets. We can string multiple datasets together to get one large lookup table. The only thing saved in these pkl files are image sizes, image location and the ground truth latex code. That way we can serve batches of images with the same dimensionality.

In [None]:
!python -m pix2tex.dataset.dataset -i dataset/data/images dataset/data/train -e /content/LaTeX-OCR/dataset/data/HANDWRTN_math.txt dataset/data/pdfmath.txt -o dataset/data/train.pkl


Modified where we pull our dataset from in our case **/content/LaTeX-OCR/dataset/data/HANDWRTN_math.txt** inside a Google Colab Runtime environment

In [None]:
!python -m pix2tex.dataset.dataset -i dataset/data/valimages dataset/data/val -e /content/LaTeX-OCR/dataset/data/HANDWRTN_math.txt dataset/data/pdfmath.txt -o dataset/data/val.pkl

In [None]:
# download the weights we want to fine tune
!curl -L -o weights.pth https://github.com/lukas-blecher/LaTeX-OCR/releases/download/v0.0.1/weights.pth

In [None]:
# If using wandb
!pip install -q wandb 
# you can cancel this if you don't wan't to use it or don't have a W&B acc.
#!wandb login

We modified batchsize, PAD, and epoch to train our handwritten data

In [None]:
# generate colab specific config (set 'debug' to true if wandb is not used)
!echo {backbone_layers: [2, 3, 7], betas: [0.9, 0.999], batchsize: 15, bos_token: 1, channels: 1, data: dataset/data/train.pkl, debug: true, decoder_args: {'attn_on_attn': true, 'cross_attend': true, 'ff_glu': true, 'rel_pos_bias': false, 'use_scalenorm': false}, dim: 256, encoder_depth: 4, eos_token: 2, epochs: 5, gamma: 0.9995, heads: 8, id: null, load_chkpt: 'weights.pth', lr: 0.001, lr_step: 30, max_height: 192, max_seq_len: 512, max_width: 672, min_height: 32, min_width: 32, model_path: checkpoints, name: mixed, num_layers: 4, num_tokens: 8000, optimizer: Adam, output_path: outputs, pad: true, pad_token: 0, patch_size: 16, sample_freq: 2000, save_freq: 1, scheduler: StepLR, seed: 42, temperature: 0.2, test_samples: 5, testbatchsize: 20, tokenizer: dataset/tokenizer.json, valbatches: 100, valdata: dataset/data/val.pkl} > colab.yaml

In [None]:
!python -m pix2tex.train --config colab.yaml