User Guide / README
This "app" is designed to provide a template for preprocessing RF waveform label data (faulting cavity and fault type labels), and for performing feature extraction using tsfresh. A script manages processing raw label files (provided by T. Powers), into a directory of label files, one per RF fault event. Additional scripts manage the work of feature extraction based on the processed label files present in the labeled-examples/processed directory.
Feature extraction is performed in parallel (thanks to GNU parallel) across the nodes specified in the nodelist file. The feature extraction work is broken into two main pieces, the extraction needed for the cavity model and the extraction needed for the fault type (a.k.a., "trip") model, as represented by the two parallel_*.bash scripts in bin/. Each parallel_*.bash script ultimately calls python/extract.py on each file in labeled-examples/processed. These jobs are meant to be run by parallel, so tsfresh.extract_features is called with n_jobs=1 (i.e., no internal parallelization).
To use this template, simply make a copy of the app somewhere with sufficient storage and accessible by the hosts that will be running feature extraction code. See the setup section for additional details.
After setting up the rfw_tsf_extractor app as described in the Setup section, simply follow the steps in the Workflow section to perform the data processing and feature extraction.
|bin/||For executable files|
|bin/process_raw_label_files.bash||Script for processing raw label files|
|bin/parallel_*_extraction.bash||Scripts for managing parallelized feature extractions|
|bin/do_*_extraction.bash||Scripts for extracting features from a single event|
|extracted/||For files containing extracted features|
|labeled-examples/||for files containing labeled examples|
|labeled-examples/raw/||for unprocessed label files generated by SME|
|labeled-examples/processed/||for processed label files (one file per event)|
|labeled-examples/event_redution.log||Audit log for processing of each event. Created by process_raw_label_files.bash|
|labeled-examples/process.log||Report information from processing labels in raw/|
|labeled_examples/master.csv||A single CSV file containing all of the processed events/labels in a single convenient file|
|log/||Contains timestamped directories of log files from a feature extraction run|
|nodefile||A file that controls which remote nodes parallel will try to run jobs|
|python/||Contains python code for performing feature extraction|
|requirements.txt||pip requirements for creating python virtual environment|
|venv/||Directory that is created to contain Python virtual environment.|
|waveform-data/||Contains the harvested waveform data on which extraction is performed|
If you are only running this on a single node (i.e., no parallelization) then the only step is to place a copy of rfw_tsf_extractor into a location directly accessible by that node (local storage, NFS, etc.). Then follow the Workflow section below.
If you want to run this on multiple nodes, you will need to place the copy of rfw_tsf_extractor in an area accessible to all nodes. The most straight- forward way to do this is to place the app on existing shared storage available to all of the nodes. For the Spring-2018 run, I setup several compute servers, installed this app on one of them, and exported it via NFS to the others. NOTE: if you follow this pattern, the app must be appear in the same location on all servers since parallel calls include with full paths to scripts and data.
Update the nodefile to contain the hostnames of the hosts that will be running the feature extraction workflow, one per line. GNU parallel uses a special ':' character to represent localhost (':' means no ssh). These nodes must be accessible via SSH for the parallel jobs to be run on them.
Copy in labeled examples and waveform data
Place any files containing labeled examples in labeled-examples/raw. These files should be TSVs of Tom's usual format, e.g., zone cavity cav# fault time
Copy over (or link) waveform data to waveform-data. The scripts expect the usual rf////<capture_file> structure under waveform-data.
Setup the Python Virtual Environment
All python code in this app was developed against stock Python 3.6.9 using the packages listed in requirements.txt. Please install a similar python interpreter and run the following commands in a bash shell.
cd /path/to/rfw_tsf_extractor /path/to/python -m venv venv source venv/bin/activate pip install -r requirements.txt deactivate
Now you have a suitable python environment. Any scripts that deal with python will be loading this environment before launching code.
Process files containing labeled examples (Tom's results)
Run the following script to turn the "raw" label files under labeled-examples/raw into a directory of files under labeled_examples/processed, which contain a single event/label and are named for the event (-.tsv).
Note that this script performs several ancillary functions besides actually processing the raw files.
- It deletes all files under processed/ prior to processing.
- It produces labeled-examples/event_reduction.log that describes the action it took for each event.
- It writes report information to stdout labeled-examples/process.log
- It writes a single file, master.csv, containing all data from the individual label files.
Update nodefile to reflect where to run feature extraction jobs. The current setup a job per core on all listed systems, which completely swamps the systems in nodefile. The job management is done using GNU's parallel.
Run parallel extraction scripts to extract features. On our 68 hyper thread setup, each script takes on the order of 12 hours (i.e., overnight-ish) to process 407 events. These script will max out CPUs on all of the hosts listed in nodefile
- For cavity model feature extraction run
- For trip model feature extraction run:
Review the results
- All logs are written to log/<cavity|trip>_<timestamp> including individual tsfresh job output, and the GNU parallel jobs_log (<cavity|trip>_<timestamp>_jobs.log)
- The results of each tsfresh job is written to the extracted/ directory. Each event will have two files per parallel_*_extraction.bash script, a *_X.csv for the extracted features and a *_y.csv for the matching label info. Should the tsfresh job encounter an error, then no CSV file will be written. Examine the GNU parallel jobs log for exit status of each job to verify that all of the output (or lack of output) is understood.