Skip to content

A small Python library to help clean TMX files, with a sample usage script

Notifications You must be signed in to change notification settings

Numeri/cleantmx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cleantmx

A small Python library to build NLP data cleaning pipelines

This library is meant to be a lightweight but powerful tool to help create data cleaning pipelines for NLP. It's designed to be flexible, and you can easily extend it with your own code.

It comes with several built in filters of different types: most operate only on an individual segment, but some can modify source/target text using information from both (to remove source/target pairs that are identical, or with mismatched segment lengths, for example).

See examples/process_tmx.py for an example of reading in a .tmx file of English-Swedish pairs, cleaning it, then saving two segment-aligned text files.

About

A small Python library to help clean TMX files, with a sample usage script

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages