Skip to content

Latest commit

 

History

History
28 lines (23 loc) · 1.14 KB

convert_txt.md

File metadata and controls

28 lines (23 loc) · 1.14 KB

TXT to TXT

Convert a folder of TXT files into a folder of bigger TXT files.

Files are foramtted as "--- {source file stem} --- \n\n{source file body}'. The source file's body will retain line breaks.

Steps

Below are the steps needed to run the conversion process. The pathing can be changed by updating the parameters.

  1. Clone this repository.
  2. Open a PowerShell window to the ~/src directory.
  3. Convert a folder of TXT files into a folder of bigger TXT files.
    • The -in/-out parameters control the source and destination folders. If the output folder does not exist it is created. WARNING: If the output folder does exist AND is not empty, new TXT files will overwrite old ones.
    • The -s parameter controls the output file name's stem. I.E. f'./{stem}.{count}.txt'. It defaults to 'stacked'.
    • The -l parameter controls how many lines (approxmitly) per new file. It defaults to 100k.
    • The optional -spc parameter allows for tuning on multi core machines. It defaults to 1.
    python convert_txt.py -in d:/corpus_in -out d:/corpus_out