Building Blocks

Below is a list of the corpus tools we use at Text Corpus Labs. They are intended to be general purpose building blocks allowing for conversion between our different processes.

NOTE: This project is currently in the process of undergoing a retrofit. The below checklist now shows conversion status. While in progress, the old cold will still work, it is just nested in a subfolder.

Operation

Install

You can install the package using the following steps:

pip install using an admin prompt.

pip uninstall buildingblocks
python -OO -m pip install -v git+https://github.com/TextCorpusLabs/building-blocks.git

Run

You can run the package in the following ways:

Extract

Pull fields from every JSON object in a JSONL file into a CSV file
```
buildingblocks extract jsonl_to_csv `
   -source d:/data/corpus `
   -dest d:/data/corpus.csv
```
The following are optional parameters
- fields are the names of the fields to extract. It defaults to "id"

Transform

Counts the n-grams in a JSONL file.
```
buildingblocks transform ngram `
   -source d:/data/corpus `
   -dest d:/data/corpus.ngrams.csv
```
The following are optional parameters
- fields are the names of the fields to process. It defaults to "text"
- size is the length of the n-gram. It defaults to 1
- top is the number of n-grams to save. It defaults to 10K
- chunk controls the amount of n-grams to chunk to disk to prevent OOM. Higher values use more ram, but compute the overall value faster. It defaults to 10M.
- keep_case (flag) keeps the casing of fields as-is before converting to tokens for counting.
- keep_punct (flag) keeps all punctuation of fields as-is before converting to tokens for counting.

TODO

All script commands are presented in PowerShell syntax. If you use a different shell, your syntax will be different.

Adding -O to the front of any script runs it in "optimized" mode. This can give as much as a 50% boost in some cases, but prevents errors from making sense. If there is an error in a run, remove the -O, capture the error, and submit an issue.

Combine

- Combine a folder of JSON files into a single JSONL file.
- Combine a folder of TXT files into a single JSONL file.

Convert

- Convert a JSONL file into a smaller JSONL file by keeping only some elements.
- Convert a folder of TXT files into a folder of bigger TXT files.
- Convert a JSONL file into a JSONT file.
- Convert a JSONT file into a JSONL file.

Extract

- Extract a folder of interleaved TXT files from a JSONL file.
- Extract a folder of JSON files from a a JSONL file.
- Extract a folder of TXT files from a JSONL file.

Merge

- Merge several folders of JSON files into a single folder of JSON files based on their file name.
- Merge several folders of TXT files into a single folder of TXT files based on their file name.

Transform

- Tokenize a JSONL file using the NLTK defaults (Punkt + Penn Treebank).

Development

Use the below instructions to setup the module for local development.

Clone this repository then open an Admin shell to the ~/ directory.

Install the required modules.

pip uninstall buildingblocks
pip install -e c:/repos/TextCorpusLabs/building-blocks

Setup the ~/.vscode/launch.json file (VS Code only)
1. Click the "Run and Debug Charm"
2. Click the "create a launch.json file" link
3. Select "Python"
4. Select "module" and enter buildingblocks
5. Select one of the following modes and add the below args to the launch.json file. The args node should be a sibling of the module node. You will need to change your pathing and arguments. The first two arguments determine the command, the other arguments are the command's parameters.
```
"args" : [
   "extract", "jsonl_to_csv",
   "-source", "d:/data/corpus",
   "-dest", "d:/data/corpus.csv",
   "-fields", "id,text"]
```

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
docs		docs
src/buildingblocks		src/buildingblocks
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building Blocks

Operation

Install

Run

Extract

Transform

TODO

Combine

Convert

Extract

Merge

Transform

Development

About

Languages

License

TextCorpusLabs/building-blocks

Folders and files

Latest commit

History

Repository files navigation

Building Blocks

Operation

Install

Run

Extract

Transform

TODO

Combine

Convert

Extract

Merge

Transform

Development

About

Topics

Resources

License

Stars

Watchers

Forks

Languages