# Processing a folder of files with a Python script

You have a folder of files you would like to process with a Python script. This recipe will take you through the process of placing a workload of multiple jobs at an HTCondor Access Point where each job processes one file. The recipe assumes that the script takes the path name of the file to be processed as a command line argument. It also assumes you have generated a container with your software environment installed. 

1. Copy workload components (container, data folder, Python script) to the Access Point. 

* Place the Python scripts (.py) in the `scripts` folder. 
* Place container in the `software` folder. 
* Place the data files (.csv) in the `inputs` folder. 

2. Create directories for output files produced by the script. 

In [None]:
mkdir results

3. Create a workload description using the HTCondor Workload Description Language (WDL) to process a subset of the files. . 

In [None]:
# Create the job list (input file ids)
ls inputs | head > job_list.txt

In [None]:
# Create a job template that reads in the job list
cat <<EOF > jobs.sub
INFILE          = $(ID).csv
OUTFILE         = $(ID).png
shell           = python3 script.py $(INFILE)

transfer_input_files   = script.py, $(INFILE)
transfer_output_files  = $(OUTFILE)
transfer_output_remaps = "$(OUTFILE)=results/$(OUTFILE)"

container_image = build/container.sif
request_cpus    = 1
request_memory  = 1GB
request_disk    = 2GB

log             = logs/$(CLUSTER).log
error           = logs/$(CLUSTER).$(ID).err
output          = logs/$(CLUSTER).$(ID).out

queue ID from job_list.txt
EOF

4. Place the test workload.  

In [None]:
condor_submit jobs.sub

5. Once completed, review the results. 

In [None]:
# are all the output files created? 
echo "Length of job_list:" `wc -l job_list.txt`
echo "Number of outputs:" `ls results | wc -l`

In [None]:
# how many resources were used per job? 
# TODO: insert condor command

6. Create a workload description for the entire folder.

In [None]:
ls inputs > job_list.txt

7. Place the full workload. 

In [None]:
condor_submit jobs.sub