Celery-LaTex-OCR

Yet another LaTex OCR Project written in PyTorch, based on ConvNeXt and Transformer.

This project is the backend of CeleryMath.

Give us a star if this project helps you 🤗

Develop

Any further developments and contributions are welcome 😄

Training Instruction

Follow the following instructions to train by yourself:

Create virtual environment.

poetry install
poetry shell

Create dataset.

You can download generated dataset from here (2.05G) or generate by yourself with the following code, the generation may be slow:

python -m src.utils.latex2png -i dataset/data/full_math.txt -w dataset/data/full_set -b 1

Edit config file.

Edit the src/config/config_convnext.json and replace the dataset path to yours.

Run training

python -m src.train

Dataset Instruction

If you have your own latex formula dataset, you can add them to dataset/data/full_math.txt and regenerate tokenizers and images.

tokenizers from hugging face was used, if you want to change formula file and output file location, edit src/dataset.py

python -m src.dataset

generate dataset, TexLive or MikeTex or similar program must be installed.

python -m src.utils.latex2png -i dataset/data/full_math.txt -w dataset/data/full_set -b 1

Issues

Open an issue if you have any questions or a PR if you can fix it.

TODO

API
Desktop Deploy, see CeleryMath
ONNX
Use pytorch-lightning to manage training and evaluation

Acknowledgement

This project was inspired by the following project, and some methods or codes were also borrowed from them, THANKS A LOT!!! 🤝

LaTex-OCR, LICENSE: MIT, https://github.com/lukas-blecher/LaTeX-OCR
LaTeX_OCR_PRO, LICENSE: GPL-3.0, https://github.com/LinXueyuanStdio/LaTeX_OCR_PRO

LICENSE

GPL-3.0, details here

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
dataset/data		dataset/data
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset/data

dataset/data

src

src

test

test

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

init.py

init.py

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

Repository files navigation

Celery-LaTex-OCR

Develop

Training Instruction

Dataset Instruction

Issues

TODO

Acknowledgement

LICENSE

About

Releases 1

Packages

Languages

License

MODCT/Celery-LaTex-OCR

Folders and files

Latest commit

History

Repository files navigation

Celery-LaTex-OCR

Develop

Training Instruction

Dataset Instruction

Issues

TODO

Acknowledgement

LICENSE

About

Resources

License

Stars

Watchers

Forks

Languages