Skip to content

Building blocks for text pre-processing

License

Notifications You must be signed in to change notification settings

TextCorpusLabs/building-blocks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building Blocks

Python MIT license Last Updated

Below is a list of the corpus tools we use at Text Corpus Labs. They are intended to be general purpose building blocks allowing for conversion between our different processes.

NOTE: This project is currently in the process of undergoing a retrofit. The below checklist now shows conversion status. While in progress, the old cold will still work, it is just nested in a subfolder.

Operation

Install

You can install the package using the following steps:

  1. pip install using an admin prompt.
    pip uninstall buildingblocks
    python -OO -m pip install -v git+https://github.com/TextCorpusLabs/building-blocks.git
    

Run

You can run the package in the following ways:

Extract

  1. Pull fields from every JSON object in a JSONL file into a CSV file
    buildingblocks extract jsonl_to_csv `
       -source d:/data/corpus `
       -dest d:/data/corpus.csv
    
    The following are optional parameters
    • fields are the names of the fields to extract. It defaults to "id"

Transform

  1. Counts the n-grams in a JSONL file.
    buildingblocks transform ngram `
       -source d:/data/corpus `
       -dest d:/data/corpus.ngrams.csv
    
    The following are optional parameters
    • fields are the names of the fields to process. It defaults to "text"
    • size is the length of the n-gram. It defaults to 1
    • top is the number of n-grams to save. It defaults to 10K
    • chunk controls the amount of n-grams to chunk to disk to prevent OOM. Higher values use more ram, but compute the overall value faster. It defaults to 10M.
    • keep_case (flag) keeps the casing of fields as-is before converting to tokens for counting.
    • keep_punct (flag) keeps all punctuation of fields as-is before converting to tokens for counting.

TODO

All script commands are presented in PowerShell syntax. If you use a different shell, your syntax will be different.

Adding -O to the front of any script runs it in "optimized" mode. This can give as much as a 50% boost in some cases, but prevents errors from making sense. If there is an error in a run, remove the -O, capture the error, and submit an issue.

Combine

    • Combine a folder of JSON files into a single JSONL file.
    • Combine a folder of TXT files into a single JSONL file.

Convert

    • Convert a JSONL file into a smaller JSONL file by keeping only some elements.
    • Convert a folder of TXT files into a folder of bigger TXT files.
    • Convert a JSONL file into a JSONT file.
    • Convert a JSONT file into a JSONL file.

Extract

    • Extract a folder of interleaved TXT files from a JSONL file.
    • Extract a folder of JSON files from a a JSONL file.
    • Extract a folder of TXT files from a JSONL file.

Merge

    • Merge several folders of JSON files into a single folder of JSON files based on their file name.
    • Merge several folders of TXT files into a single folder of TXT files based on their file name.

Transform

    • Tokenize a JSONL file using the NLTK defaults (Punkt + Penn Treebank).

Development

Use the below instructions to setup the module for local development.

  1. Clone this repository then open an Admin shell to the ~/ directory.
  2. Install the required modules.
    pip uninstall buildingblocks
    pip install -e c:/repos/TextCorpusLabs/building-blocks
    
  3. Setup the ~/.vscode/launch.json file (VS Code only)
    1. Click the "Run and Debug Charm"
    2. Click the "create a launch.json file" link
    3. Select "Python"
    4. Select "module" and enter buildingblocks
    5. Select one of the following modes and add the below args to the launch.json file. The args node should be a sibling of the module node. You will need to change your pathing and arguments. The first two arguments determine the command, the other arguments are the command's parameters.
      "args" : [
         "extract", "jsonl_to_csv",
         "-source", "d:/data/corpus",
         "-dest", "d:/data/corpus.csv",
         "-fields", "id,text"]