ocr_errors_simulator

Functions and codes used to determine probabilities on OCR errors and simulate them

For the charset, use JSONL_reading.py to preprocess the ecco file (creation of different files to create chunks of the compressed data). Then, use charset.py to create the charset that will be in a file text. For the JSONL file for probabilities, use CSV_convert.py to preprocess the CSV files to fit with the following Python file, OCR_errors_JSON_generator.py. If everything has been done correctly, use OCR_noise.py to create OCR noise.

To use OCR_noise.py, please use the following command:

python3 OCR_noise.py {--seed 0} --charset data_files/ecco_charset.txt --charset-probs data_files/ecco_i/ecco_i_probs.jsonl {clean/texts/path.jsonl} > {noised/texts/path.jsonl}

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data_files		data_files
samples_functions		samples_functions
CSV_convert.py		CSV_convert.py
JSONL_reading.py		JSONL_reading.py
OCR_errors_JSON_generator.py		OCR_errors_JSON_generator.py
OCR_errors_JSON_generator_functions.py		OCR_errors_JSON_generator_functions.py
OCR_noise.py		OCR_noise.py
README.md		README.md
charset.py		charset.py
charset_from_ecco.py		charset_from_ecco.py
charset_from_gallica.py		charset_from_gallica.py
english_probs.jsonl		english_probs.jsonl
from_jsonl_ocr_errors_jsonl_generator.py		from_jsonl_ocr_errors_jsonl_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_files

data_files

samples_functions

samples_functions

CSV_convert.py

CSV_convert.py

JSONL_reading.py

JSONL_reading.py

OCR_errors_JSON_generator.py

OCR_errors_JSON_generator.py

OCR_errors_JSON_generator_functions.py

OCR_errors_JSON_generator_functions.py

OCR_noise.py

OCR_noise.py

README.md

README.md

charset.py

charset.py

charset_from_ecco.py

charset_from_ecco.py

charset_from_gallica.py

charset_from_gallica.py

english_probs.jsonl

english_probs.jsonl

from_jsonl_ocr_errors_jsonl_generator.py

from_jsonl_ocr_errors_jsonl_generator.py

Repository files navigation

ocr_errors_simulator

About

Releases

Packages

Languages

TurkuNLP/ocr_errors_simulator

Folders and files

Latest commit

History

Repository files navigation

ocr_errors_simulator

About

Resources

Stars

Watchers

Forks

Languages