<a href="https://colab.research.google.com/github/6uan/LaTeX-OCR-Testing/blob/main/notebooks/LaTeX_OCR_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train a LaTeX OCR model
In this brief notebook I show how you can finetune/train an OCR model.

I've opted to mix in handwritten data into the regular pdf LaTeX images. For that I started out with the released pretrained model and continued training on the slightly larger corpus.

In [1]:
!pip install pix2tex[train] -qq

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/422.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━[0m [32m194.6/422.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m422.9/422.9 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m431.5/431.5 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.4/169.4 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
!mkdir -p LaTeX-OCR
os.chdir('LaTeX-OCR')

In [3]:
!pip install gpustat -q
!pip install opencv-python-headless==4.8.1.78 -U -q
!pip install --upgrade --no-cache-dir gdown -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/98.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.1/98.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.4/58.4 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for gpustat (pyproject.toml) ... [?25l[?25hdone


In [5]:
import os
import shutil
import random
from glob import glob

# Create the necessary directories if they don't exist
!mkdir -p dataset/data
!mkdir -p dataset/data/images  # Ensure this directory is correctly referenced

# Download the datasets using gdown
!gdown -O dataset/data/HANDWRTN.zip --id 18-bFQV5m1Ir1pq8deQiAkw743KpBSBN3         # NEW DATA
!gdown -O dataset/data/pdf.zip --id 176PKaCUDWmTJdQwc-OfkO0y8t4gLsIvQ              # PDF IMAGES DATA
!gdown -O dataset/data/pdfmath.txt --id 1QUjX6PFWPa-HBWdcY-7bA5TRVUnbyS1D          # PDF MATH DATA

# Unzip the downloaded datasets
!unzip -q dataset/data/HANDWRTN.zip -d dataset/data
!unzip -q dataset/data/pdf.zip -d dataset/data

# Define the number of validation images you want
number_of_val_images = 1000

# Change directory to where the images are expected to be after unzipping
images_path = 'dataset/data/images'
val_images_path = 'dataset/data/valimages'

handwrtn_txt = '/content/LaTeX-OCR/dataset/data/HANDWRTN/HANDWRTN_math.txt'

# Check if the 'images' directory exists and has files
if os.path.exists(images_path) and os.path.isdir(images_path):
    # Get all the image files in the 'images' directory
    all_images = glob(os.path.join(images_path, '*'))

    # Shuffle the list of images
    random.shuffle(all_images)

    # Ensure the validation directory exists
    os.makedirs(val_images_path, exist_ok=True)

    # Move a subset of images to the 'valimages' directory
    for img in all_images[:number_of_val_images]:
        shutil.move(img, val_images_path)

    # The remaining files in 'images' are your training set
else:
    print(f"The directory {images_path} does not exist or is not a directory.")


Downloading...
From (uriginal): https://drive.google.com/uc?id=18-bFQV5m1Ir1pq8deQiAkw743KpBSBN3
From (redirected): https://drive.google.com/uc?id=18-bFQV5m1Ir1pq8deQiAkw743KpBSBN3&confirm=t&uuid=b0470252-7722-4b5f-9d3f-922d90d20a99
To: /content/LaTeX-OCR/dataset/data/HANDWRTN.zip
100% 344M/344M [00:01<00:00, 199MB/s]
Downloading...
From (uriginal): https://drive.google.com/uc?id=176PKaCUDWmTJdQwc-OfkO0y8t4gLsIvQ
From (redirected): https://drive.google.com/uc?id=176PKaCUDWmTJdQwc-OfkO0y8t4gLsIvQ&confirm=t&uuid=2bdc06a4-8fa2-41ac-a812-f84aec659606
To: /content/LaTeX-OCR/dataset/data/pdf.zip
100% 284M/284M [00:01<00:00, 216MB/s]
Downloading...
From: https://drive.google.com/uc?id=1QUjX6PFWPa-HBWdcY-7bA5TRVUnbyS1D
To: /content/LaTeX-OCR/dataset/data/pdfmath.txt
100% 36.6M/36.6M [00:00<00:00, 207MB/s]


Now we generate the datasets. We can string multiple datasets together to get one large lookup table. The only thing saved in these pkl files are image sizes, image location and the ground truth latex code. That way we can serve batches of images with the same dimensionality.

Need Complete Path to make [15] and [16] work!

In [6]:
!pwd

/content/LaTeX-OCR


In [7]:
!python -m pix2tex.dataset.dataset -i dataset/data/images dataset/data/train -e /content/LaTeX-OCR/dataset/data/HANDWRTN_math.txt dataset/data/pdfmath.txt -o dataset/data/train.pkl


Generate dataset
0it [00:00, ?it/s]
100% 158480/158480 [00:02<00:00, 62678.98it/s]


In [8]:
!python -m pix2tex.dataset.dataset -i dataset/data/valimages dataset/data/val -e /content/LaTeX-OCR/dataset/data/HANDWRTN_math.txt dataset/data/pdfmath.txt -o dataset/data/val.pkl

Generate dataset
100% 1000/1000 [00:00<00:00, 66994.17it/s]
100% 6765/6765 [00:00<00:00, 58156.08it/s]


In [9]:
# download the weights we want to fine tune
!curl -L -o weights.pth https://github.com/lukas-blecher/LaTeX-OCR/releases/download/v0.0.1/weights.pth

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 97.3M  100 97.3M    0     0  93.8M      0  0:00:01  0:00:01 --:--:--  134M


In [10]:
# If using wandb
!pip install -q wandb
# you can cancel this if you don't wan't to use it or don't have a W&B acc.
#!wandb login

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.6/190.6 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.8/252.8 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [11]:
# generate colab specific config (set 'debug' to true if wandb is not used)
!echo {backbone_layers: [2, 3, 7], betas: [0.9, 0.999], batchsize: 15, bos_token: 1, channels: 1, data: dataset/data/train.pkl, debug: true, decoder_args: {'attn_on_attn': true, 'cross_attend': true, 'ff_glu': true, 'rel_pos_bias': false, 'use_scalenorm': false}, dim: 256, encoder_depth: 4, eos_token: 2, epochs: 5, gamma: 0.9995, heads: 8, id: null, load_chkpt: 'weights.pth', lr: 0.001, lr_step: 30, max_height: 192, max_seq_len: 512, max_width: 672, min_height: 32, min_width: 32, model_path: checkpoints, name: mixed, num_layers: 4, num_tokens: 8000, optimizer: Adam, output_path: outputs, pad: true, pad_token: 0, patch_size: 16, sample_freq: 2000, save_freq: 1, scheduler: StepLR, seed: 42, temperature: 0.2, test_samples: 5, testbatchsize: 20, tokenizer: dataset/tokenizer.json, valbatches: 100, valdata: dataset/data/val.pkl} > colab.yaml

{backbone_layers: [2, 3, 7],
 betas: [0.9, 0.999],
 batchsize: 10,
 bos_token: 1,
 channels: 1,
 data: dataset/data/train.pkl,
 debug: true,
 decoder_args: {'attn_on_attn': true, 'cross_attend': true, 'ff_glu': true, 'rel_pos_bias': false, 'use_scalenorm': false},
 dim: 256,
 encoder_depth: 4,
 eos_token: 2,
 epochs: 50,
 gamma: 0.9995,
 heads: 8,
 id: null,
 load_chkpt: 'weights.pth',
 lr: 0.001,
 lr_step: 30,
 max_height: 192,
 max_seq_len: 512,
 max_width: 672,
 min_height: 32,
 min_width: 32,
 model_path: checkpoints,
 name: mixed,
 num_layers: 4,
 num_tokens: 8000,
 optimizer: Adam,
 output_path: outputs,
 pad: false,
 pad_token: 0,
 patch_size: 16,
 sample_freq: 2000,
 save_freq: 1,
 scheduler: StepLR,
 seed: 42,
 temperature: 0.2,
 test_samples: 5,
 testbatchsize: 20,
 tokenizer: dataset/tokenizer.json,
 valbatches: 100,
 valdata: dataset/data/val.pkl}

In [12]:
!python -m pix2tex.train --config colab.yaml

[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice: 2
[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Tracking run with wandb version 0.16.0
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/LaTeX-OCR/wandb/run-20231204_205605-ouump11v[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mmixed[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/6ua

In [13]:
!python eval.py -d path/to/crohme_test.pkl -c path/to/latest/checkpoint --config path/to/config.yaml!

python3: can't open file '/content/LaTeX-OCR/eval.py': [Errno 2] No such file or directory
