# Pelican HTCondor Plugin

The real power of Pelican is the ability to provide data to high throughput computing.
To demonstrate this, we'll use HTCondor to do a rudimentary analysis of multiple objects of the [NOAA Global Historical Climatology Network](https://www.ncei.noaa.gov/metadata/geoportal/rest/metadata/item/gov.noaa.ncdc:C00861/html) dataset.

## About the data

From the [README](https://docs.opendata.aws/noaa-ghcn-pds/readme.html):

> GHCN-Daily is a dataset that contains daily observations over global land areas.
> It contains station-based measurements from land-based stations worldwide, about two thirds of which are for precipitation measurements only (Menne et al., 2012).
> GHCN-Daily is a composite of climate records from numerous sources that were merged together and subjected to a common suite of quality assurance reviews (Durre et al., 2010).

The GHCN data set is available via [Amazon's Open Data](https://aws.amazon.com/opendata/) repository, at [https://noaa-ghcn-pds.s3.amazonaws.com/index.html](https://noaa-ghcn-pds.s3.amazonaws.com/index.html).
The Open Data repository is already connected to the OSDF under the namespace `aws-opendata`. 
With a little digging, we find that the NOAA dataset is accessible via `us-east-1/noaa-ghcn-pds`.
Altogether, our starting Pelican URL is `osdf:///aws-opendata/us-east-1/noaa-ghcn-pds`.

## Exploring the data

For this portion, we'll use the Pelican CLI to explore the data, but in principle you can use the PelicanFS Client to accomplish the same thing.

First, download a copy of the index file:

In [1]:
pelican object get osdf:///aws-opendata/us-east-1/noaa-ghcn-pds/ghcnd-stations.txt ./ghcnd-stations.txt

In [2]:
head ghcnd-stations.txt

ACW00011604  17.1167  -61.7833   10.1    ST JOHNS COOLIDGE FLD                       
ACW00011647  17.1333  -61.7833   19.2    ST JOHNS                                    
AE000041196  25.3330   55.5170   34.0    SHARJAH INTER. AIRP            GSN     41196
AEM00041194  25.2550   55.3640   10.4    DUBAI INTL                             41194
AEM00041217  24.4330   54.6510   26.8    ABU DHABI INTL                         41217
AEM00041218  24.2620   55.6090  264.9    AL AIN INTL                            41218
AF000040930  35.3170   69.0170 3366.0    NORTH-SALANG                   GSN     40930
AFM00040938  34.2100   62.2280  977.2    HERAT                                  40938
AFM00040948  34.5660   69.2120 1791.3    KABUL INTL                             40948
AFM00040990  31.5000   65.8500 1010.0    KANDAHAR AIRPORT                       40990


In [3]:
pelican object get osdf:///aws-opendata/us-east-1/noaa-ghcn-pds/csv/by_station/USW00014837.csv ./

In [4]:
head USW00014837.csv

ID,DATE,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
USW00014837,19391001,TMAX,194,,,X,
USW00014837,19391002,TMAX,211,,,X,
USW00014837,19391003,TMAX,233,,,X,
USW00014837,19391004,TMAX,272,,,X,
USW00014837,19391005,TMAX,211,,,X,
USW00014837,19391006,TMAX,250,,,X,
USW00014837,19391007,TMAX,294,,,X,
USW00014837,19391008,TMAX,261,,,X,
USW00014837,19391009,TMAX,239,,,X,


## Rudimentary climate analysis

The included script `example.py` performs a very rudimentary analysis of the station data.
The script takes the station ID as the argument and creates a corresponding `.png` file.

In [5]:
./example.py USW00014837

ELEMENT          TMIN          TMAX
count    31176.000000  31176.000000
mean        36.415087     56.619965
std         20.041446     22.332279
min        -36.940000    -14.080000
25%         23.000000     37.040000
50%         37.040000     59.000000
75%         53.060000     77.000000
max         82.940000    122.000000


Plotting histograms of observations for 31,176 days, spanning 85.4 years 
from 1939-10-01 to 2025-02-06, to 'USW00014837.png' .



Take a look at the new `.png` file!

## Scaling out

There are a lot of stations..

In [6]:
wc -l ghcnd-stations.txt

129657 ghcnd-stations.txt


Suppose you want to do a proper climate analysis, but it takes 1 hour to run per station.

If executed serially (one after the next), this would take ~130,000 hours, or

In [7]:
python3 -c 'print(f"{(129657 / 24 / 365):.2f} years!")'

14.80 years!


The power of HTCondor and high throughput computing is the ability to place many individual calculations across many computers.
Users of the OSPool and similar high throughput computing systems regularly run *thousands* of jobs at time. 

At a rate of 1,000 stations analyzed per hour, the "real" runtime becomes

In [8]:
python3 -c 'print(f"{(129657 / 1000 / 24):.2f} days!")'

5.40 days!


Finally, since the data is available via the OSDF, you don't have to worry about moving the data around as part of the compute.
Just provide the necessary Pelican URLs!

## Scaling out with HTCondor and Pelican

The included `example.sub` demonstrates how to run the rudimentary climate analysis on 10 stations.

In [9]:
cat example.sub

container_image = osdf:///ospool/uc-shared/public/OSG-Staff/training/python.sif

executable   = example.py
arguments    = $(STATION_ID)

OSDF_PREFIX  = osdf:///aws-opendata/us-east-1/noaa-ghcn-pds/csv/by_station
transfer_input_files = my_functions.py, $(OSDF_PREFIX)/$(STATION_ID).csv

should_transfer_files = YES
transfer_output_remaps = "$(STATION_ID).png=results/$(STATION_ID).png"

log          = logs/example.$(Cluster).log
output       = logs/$(STATION_ID).out
error        = logs/$(STATION_ID).err

request_cpus   = 1
request_memory = 2GB
request_disk   = 2GB

queue STATION_ID from station_list.txt



The only thing that is needed is the `station_list.txt` file with the list of station IDs for the datasets you want to analyze.

You can generate this list with

In [10]:
./generate_list.sh

In [11]:
cat station_list.txt

USW00014837
USW00014838
USW00014839
USW00014840
USW00014841
USW00014842
USW00014843
USW00014844
USW00014845
USW00014846


## Submitting a list of jobs

With the set-up complete, you can submit the list of jobs as usual using HTCondor:

In [15]:
condor_submit example.sub

Submitting job(s)..........
10 job(s) submitted to cluster 1.


Then monitor the progress of the jobs with `condor_q`, or by running `condor_watch_q` in the terminal console (doesn't work right in the notebook interface..).

In [16]:
condor_q



-- Schedd: jovyan@jupyter-aowen4-wisc-edu---0bf5e8be : <10.129.164.237:9618?... @ 06/04/25 15:05:52
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended 
Total for all users: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended



Once complete, you should see a bunch of image files in the `results/` directory. 

You'll be shocked to learn that winters are colder than summers (at least in Wisconsin)!