TextDirectory allows you to combine multiple text files into one aggregated file.
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github
data
docs
tests
textdirectory
.editorconfig
.gitignore
.travis.yml
AUTHORS.rst
CONTRIBUTING.rst
HISTORY.rst
LICENSE
MANIFEST.in
Makefile
README.rst
__init__.py
environment.yml
publish_to_pypi.bat
readthedocs.yml
requirements.txt
requirements_dev.txt
setup.cfg
setup.py
tox.ini

README.rst

TextDirectory

Documentation Status


TextDirectory

TextDirectory allows you to combine multiple text files into one aggregated file. TextDirectory also supports matching files for certain criteria and applying transformations to the aggregated text.

TextDirectory can be used as a mere tool (via the CLI) and as a Python library.

Of course, everything TextDirectory does could be achieved in bash or PowerShell. However, there are certain use-cases (e.g. when used as a library) in which it might be useful.

Features

  • Aggregating multiple text files
  • Filtering documents/texts based on various parameters such as length, content, and random sampling
  • Transforming the aggregated text (e.g. transforming the text to lowercase)
Version Filters Transformations
0.1.0 filter_by_max_chars(n int); filter_by_min_chars(n int); filter_by_max_tokens(n int); filter_by_min_tokens(n int); filter_by_contains(str); filter_by_not_contains(str); filter_by_random_sampling(n int; replace=False) transformation_lowercase
0.1.1 filter_by_chars_outliers(n sigmas int) transformation_remove_nl
0.1.2 filter_by_filename_contains(str) transformation_usas_en_semtag; transformation_uppercase; transformation_postag(spaCy model)
0.1.3 filter_by_similar_documents(reference_file str; threshold float) transformation_remove_non_ascii; transformation_remove_non_alphanumerical
0.2.0 filter_by_max_filesize(max_kb int); filter_by_min_filesize(min_kb int) transformation_to_leetspeak; transformation_crude_spellchecker(language model str)

Quickstart

Install TextDirectory via pip: pip install textdirectory

TextDirectory, as exemplified below, works with a two-stage model. After loading in your data (directory) you can iteratively select the files you want to process. In a second step you can perform transformations on the text before finally aggregating it.

TextDirectory

As a Command-Line Tool

TextDirectory comes equipped with a CLI.

The syntax for both the filters and tranformations works similarly. They are chained by adding slashes (/) and parameters are passed via commas (,): filter_by_min_tokens,5/filter_by_random_sampling,2.

Example 1: A Very Simple Aggregation

textdirectory --directory testdata --output_file aggregated.txt

This will take all files (.txt) in testdata and then aggregates the files into a file called aggregated.txt.

Example 2: Applying Filters and Transformations

In this example we want to filter the files based on their token count, perform a random sampling and finally transform all text to lowercase.

textdirectory --directory testdata --output_file aggregated.txt --filters filter_by_min_tokens,5/filter_by_random_sampling,2 --transformations transformation_lowercase

After passing two filters (filter_by_min_tokens and filter_by_random_sampling) we've applied the transform_lowercase transformation.

The resulting file will contain the content of two files that each have at least five tokens.

As a Python Library

In order to demonstrate TextDirectory as a Python library, we'll recreate the second example from above:

import textdirectory
td = textdirectory.TextDirectory(directory='testdata')
td.load_files(recursive=False, filetype='txt', sort=True)
td.filter_by_min_tokens(5)
td.filter_by_random_sampling(2)
td.stage_transformation(['transform_lowercase'])
td.aggregate_to_file('aggregated.txt')

If we wanted to keep working with the actual aggregated text, we could have called text = td.aggregate_to_memory().

ToDo

  • Increasing test coverage
  • Writing better documentation
  • Adding better error handling (raw exception are, well ...)
  • Adding logging
  • Implementing autodoc (via Sphinx)

Behaviour

We are not holding the actual texts in memory. This leads to much more disk read activity (and time inefficiency), but saves memory.

transformation_usas_en_semtag relies on the web versionof Paul Rayson's USAS Tagger. Don't use this transformation for large amounts of text, give credit, and consider using their commercial product Wmatrix.

Credits

This package is based on the audreyr/cookiecutter-pypackage coockiecutter template. The crude spellchecker (transformation) is implemented following Peter Norvig's excellent tutorial.