# Squaring a Number via HTCondor

This is a testing notebook for thinking about HTCondor job submission from Python.

---

## Squaring a Number

Suppose that you have been given the task of squaring a number, like `2`. You might simply do

In [1]:
2 ** 2

4

Which will, of course, work. However, once the work that we want to perform becomes much more complicated, and once we want to run it somewhere that isn't in our notebook (perhaps because it takes a very long time, or needs more resources, like memory, than we have locally), we need to think more deeply about what we're doing. We will take this simple example of squaring a number and think about how to turn it into a Jupyter Notebook-based workflow that can be run on an HTCondor pool.

## Blowing up the Process

Let's rewrite the above computation in a way that exposes some of its implicit behavior.

We will define a Python function that squares numbers.

In [2]:
def square(x):
    return x ** 2

We can use it like this:

In [3]:
x = 2
y = square(x)
print(y)

4


We have explicitly separated the workflow into steps. We define the inputs (`x = 2`), pass them to the function (`square(x)`), and then retrieve output from the function (`y = `). This separation is critical, because it lets us replace individual steps with other methods, as we'll do below.

More generally, we can think of any program as a "function" that takes "input" and returns "output". The separation we've created above is really just an example of computation in general: $y = f(x)$, for some function $f$ and input $x$. The input and output in this "square a number" example are numbers, but they could be anything: complex Python objects, text files, data files in some arcane format, etc. The important thing about this is the **structure** of the computation.

To prove this point, let's write a version of this calculation that reads its input from a file, and writes its output to another file.

## Reading and Writing Files in Python

First, let's learn how to work with files.
The Python standard library's `pathlib` module provides very convenient ways to write and read files:

In [4]:
from pathlib import Path

In [5]:
test_file = Path('test')
test_file.write_text('Hello world!')  # this writes "Hello world!" to the file
test_file.read_text()                 # this reads the text back from the file

'Hello world!'

We can store a number in a file by turning it into a string when we write it, then turning it back into an `int` when we read it out:

In [6]:
number_file = Path('number_test')
number_file.write_text(str(5))
number = int(number_file.read_text())
print(number, type(number))

5 <class 'int'>


## Squaring a Number from a File, and Writing the Result to Another File

Now that we know how to write files, we can write a **wrapper** function around `square` that lets it take input from a file and write output to a file. We will pass in both files as `Path` objects, like we used above.

In [7]:
def f(input_file, output_file):
    x = int(input_file.read_text())
    
    y = square(x)
    
    output_file.write_text(str(y))

Let's test that it works:

In [8]:
x = 2

input_file = Path('input')
input_file.write_text(str(x))

output_file = Path('output')

In [9]:
f(input_file, output_file)

In [10]:
y = int(output_file.read_text())
print(y)

4


We have performed the same computation as we set out to do initially, but with files as intermediary data transfer.
This may have seemed like an arbitrary detour, but it turns out that this is exactly how HTCondor expects us to represent our work.

Let's formalize this idea by writing helper functions for creating and reading the input and output files, respectively:

In [11]:
def create_input_file(x, input_file):
    input_file.write_text(str(x))
    
def read_output_file(output_file):
    return int(output_file.read_text())

The new workflow using the wrapper function now looks like this:

In [12]:
# the actual input
x = 2                         

# the data transfer files
input_file = Path('input')      
output_file = Path('output')

# put the input in the input file
create_input_file(x, input_file)

# run the function
f(input_file, output_file)

# get the output from the output file
y = read_output_file(output_file)
print(y)

4


Note that we named the wrapper function `f`. We do this to indicate that, in the real world, we may not know exactly what `f` does: it is a "black box". Here, we wrote `f` ourselves, but in practice `f` could be some arbitrary block of code, which could even run other programs! All we know about `f` is its **signature**: it takes two arguments, the first of which is the input file, and the second of which is the output file.

## Squaring a Number via HTCondor

HTCondor requires that
1. Our work is wrapped up in single **function** that it can run.
1. The **inputs** to that function are provided as data encoded in a **file**.
1. The **outputs** of that function are returned as data encoded in a **file**.

These three steps roughly correspond to the "blown up" process described above. We have a step for input, a step that runs the function, and a step for output.
We have already written the basic code for each step:
1. The `square_wrapper` function is the function that HTCondor will run.
1. The input number `x` was written to a file named `input`.
1. We can read the output number `y` from a file, and convert it back to an integer.

So we're more than halfway there! We just need to know how to tell HTCondor what function to run, and where to find the input file.

## Creating an HTCondor Task

Our first step is to import some things:

In [13]:
from htcondor_job import Task, TaskState

The `Task` object represents the work that we want done. To make a `Task`, we need to give it two things: the function to run, and the input file.

In [14]:
task = Task(
    function = f,
    input_file = input_file,
)
task

Task [TaskState.Unsubmitted] f(input)

Note that the task is in the `Unsubmitted` state. It also tells us what function it will run (`square_wrapper`) and the location of the input file (`input`).

We have not yet told HTCondor to actually run the task. To do so, we `submit` the task. HTCondor will then schedule it for execution.

In [15]:
task.submit()

Task [TaskState.Unsubmitted] f(input)

The state of a `Task` is available through the attribute `Task.state`. This attribute will be updated in the background for you.

In [16]:
possible_states = "\n  ".join(str(t) for t in TaskState)
print(f'The possible task states are:\n  {possible_states}\n')
print(f'The current state of task is {task.state}')

The possible task states are:
  TaskState.Unsubmitted
  TaskState.Idle
  TaskState.Running
  TaskState.Submitted
  TaskState.Held
  TaskState.Completed
  TaskState.Removed

The current state of task is TaskState.Unsubmitted


Wait for completion:

In [17]:
import time

while task.state is not TaskState.Completed:
    print(task.state)
    time.sleep(1)
print(task.state)   # print out the final state

TaskState.Unsubmitted
TaskState.Idle
TaskState.Idle
TaskState.Idle
TaskState.Running
TaskState.Completed


Read the task's output file:

In [18]:
y = int(task.output_file.read_text())
print(y)

4


## Putting it All Together

Our original workflow, once we had separated out the individual steps, looked like this:

In [19]:
x = 2          # define input
y = square(x)  # pass input to function; run function; get output
print(y)

4


If we put the HTCondor-powered worklfow all together in one cell, it looks like this:

In [21]:
# define input
x = 2

# the data transfer files
input_file = Path('input')      
output_file = Path('output')

# put the input in the input file
create_input_file(x, input_file)

# pass input to function
task = Task(
    function = f,
    input_file = input_file,
)

# actually run the function
task.submit()

# wait for completion
while task.state is not TaskState.Completed:
    time.sleep(1)
    
# get output
y = read_output_file(output_file)
print(y)

4


The same steps are all present!
They just look a little different, because we wanted to run the function via HTCondor.

## Adapting to your Work

What we have accomplished above is taking a **specific** task, like squaring a number, and building an **abstract workflow** around it. This abstract workflow is fundamental: any computation can be performed inside `f`, as long as it can be expressed as a flow of data from an `input_file` to an `output_file`. We wrote three functions to accomplished this:

* `f`
* `create_input_file`
* `read_output_file`

To run your own work in this framework, you just need to define these three functions for whatever your specific problem is.

### Writing `f`

The internals of `f` may be unknown; likely, it is an external program. You can use Python's `subprocess` standard library package to run external programs.

### Writing `create_input_file` and `read_output_file`

Again, the details of these functions depend on the behavior of `f`. Python's standard library is good at writing and reading plain text, JSON, CSV, and other simple, general-use file formats. More domain-specific formats can be read via third-party libraries. Google is your friend here!