# 2024/07/16 PRISM File Usage and Size
_Author: Meaghan Freund_

PRISM (Parameter-elevation Regressions on Independent Slopes Model) is an Oregon state website, which supplies daily, monthly, and annual files of observed climate values and attributes as raster files. These files are formatted as zip files, in which contains a lot of information considering that the geography of the raster datasets is all of the United States, excluding Hawaii and Alaska. 

Considering a daily view of these values, one zip file (which contains 8 files in each instance) would have to be downloaded for a singular day, which could take up loads of space and time. Before making a proper ADRIO template for utilizing the PRISM data, 
some testing is required to determine the amount of space for a range of days the data takes up and the time it takes to run.

For more information concerning PRISM, visit the PRISM website homepage: _https://prism.oregonstate.edu/_

## Fetching Data

The following script is designed to fetch the urls of the zip files that are needed for a specified number of days. The urls of PRISM files have the same file name format, but change based on the attributes and the dates, which the fetch_raster accounts for with the given parameters. These functions do not incorporate the caching system that the ADRIO template for PRISM would have, since the current issue is actively downloading the zip files for PRISM.

In [1]:
import os
import zipfile
from datetime import date as datetype
from datetime import timedelta
from pathlib import Path

import requests
from dateutil.relativedelta import relativedelta

from epymorph.data_shape import Shapes
from epymorph.error import DataResourceException
from epymorph.simulation import AttributeDef, TimeFrame

# abbreviations for file urls
attrib_vars = {
    "name": [""],
    "precipitation": ["ppt"],
    "mean_temperature": ["tmean"],
    "min_temperature": ["tmin"],
    "max_temperature": ["tmax"],
    "mean_dew_point_temp": ["tdmean"],
    "min_vpd": ["vpdmin"],
    "max_vpd": ["vpdmax"],
}


def download_bil_file(url):
    """Opens the given url of a zip file and outputs the file path of the bil file"""
    # set up directory and zip file
    local_zip_file = Path(url).name
    extract_dir = local_zip_file.replace(".zip", "")

    # open the url and open the zip
    response = requests.get(url, timeout=10)
    with Path(local_zip_file).open("wb") as file:
        file.write(response.content)

    with zipfile.ZipFile(local_zip_file, "r") as zip_ref:
        zip_ref.extractall(extract_dir)

    bil_file_path = None

    # search for the bil file inside
    for root, dirs, files in os.walk(extract_dir):
        for file in files:
            if file.endswith(".bil"):
                bil_file_path = Path(root) / file
                break
    return bil_file_path


def fetch_raster(attribute: AttributeDef, date_range: TimeFrame):
    """Fetch the raster files from the PRISM index"""
    # set some date variables with the date_range
    latest_date = datetype.today() - timedelta(days=1)
    first_day = date_range.start_date
    last_day = date_range.end_date

    # PRISM only accounts for after 1981 up to the previous day to "yesterday"
    if first_day.year < 1981 or last_day > latest_date:
        msg = (
            "Given date range is out of range, please enter dates between "
            f"January 1st 1981 and {latest_date}"
        )
        raise DataResourceException(msg)

    # create the list of days in date_range
    date_list = [
        first_day + timedelta(days=x) for x in range((last_day - first_day).days + 1)
    ]
    url_list = []
    bil_file_paths = []

    # the stability of PRISM data is defined by date, specified around the 6 month mark
    six_months_ago = datetype.today() + relativedelta(months=-6)
    last_completed_month = six_months_ago.replace(day=1) - timedelta(days=1)

    for single_date in date_list:
        # if it is within the current month
        if (
            single_date.year == latest_date.year
            and single_date.month == latest_date.month
        ):
            stability = "early"

        # if it is before the last finished month
        elif single_date >= last_completed_month:
            stability = "provisional"

        # if it is older than 6 completed months
        else:
            stability = "stable"

        # format the date for the urls
        formatted_date = single_date.strftime("%Y%m%d")
        year = single_date.year

        # get the abbreviation for the variable
        for var in attrib_vars.keys():
            if str(attribute.name).startswith(var):
                attribute_name = attrib_vars[var][0]

        url = f"https://ftp.prism.oregonstate.edu/daily/{attribute_name}/{year}/PRISM_{attribute_name}_{stability}_4kmD2_{formatted_date}_bil.zip"
        url_list.append(url)

    # output all of the urls of the bil files
    bil_file_paths = [download_bil_file(url) for url in url_list]

    return bil_file_paths

In [2]:
# set example attributes
start_date = datetype(2023, 3, 1)
days = 3
end_date = start_date + timedelta(days=days)
date_range = TimeFrame.range(start_date, end_date)
attribute = AttributeDef("precipitation", float, Shapes.NxN)

files = fetch_raster(attribute, date_range)
print("Fetched files from PRISM: ")
for file in files:
    print(f"{file}\n")

Fetched files from PRISM: 
PRISM_ppt_stable_4kmD2_20230301_bil/PRISM_ppt_stable_4kmD2_20230301_bil.bil

PRISM_ppt_stable_4kmD2_20230302_bil/PRISM_ppt_stable_4kmD2_20230302_bil.bil

PRISM_ppt_stable_4kmD2_20230303_bil/PRISM_ppt_stable_4kmD2_20230303_bil.bil

PRISM_ppt_stable_4kmD2_20230304_bil/PRISM_ppt_stable_4kmD2_20230304_bil.bil



## Retrieving the Data via Centroids

For reading raster files, it is uncertain on how the data for granularities would be read. Ideally for an ADRIO template for PRISM, we would start with multiple strategies and narrow down the most accurate representation of locations. However for simplicity in testing, I have implemented the manner of fetching the coordinates of centroids of a given location. This given example is using the county granularity and calculating the centroid from there.

The output of these functions are matrices, read with the columns being the dates, the enclosed rows being each county, and the intersection being the value for that county on that day. This example shows the amount of precipitation (in mm) in two Ohio counties in July 1st-4th, 2023.

In [3]:
from datetime import date as datetype

import numpy as np
import rasterio
from numpy.typing import NDArray

from epymorph.data_shape import Shapes
from epymorph.simulation import AttributeDef, TimeFrame


def raster_values_at_centroids(
    attribute: AttributeDef, date_range: TimeFrame, centroids: NDArray
) -> NDArray[np.float64]:
    """
    Retrieves the raster value at a centroid of a geoid.
    """
    raster_paths = fetch_raster(attribute, date_range)
    results = []

    # read in each file
    for raster_file in raster_paths:
        raster_path = Path(raster_file)
        with rasterio.open(raster_path) as src:
            # retrieve the coordinates from centroids
            coords = [
                (x, y) for x, y in zip(centroids["longitude"], centroids["latitude"])
            ]
            # round and save the raster values
            values = [round(x[0], 3) for x in src.sample(coords)]

        results.append(values)

    # create numpy array
    climate_vals = np.array(results)

    return climate_vals

In [4]:
from epymorph.data_type import CentroidDType

counties = ["39001", "39083"]  # Adams County and Knox County
centroids = np.array(
    [(-83.47214942, 38.84550293), (-82.42153605, 40.39876741)], dtype=CentroidDType
)

# set example attributes
start_date = datetype(2023, 3, 1)
end_date = datetype(2023, 3, 4)
date_range = TimeFrame.range(start_date, end_date)
start_date_str = start_date.strftime("%m/%d/%Y")
end_date_str = end_date.strftime("%m/%d/%Y")
attribute = AttributeDef("precipitation", float, Shapes.NxN)

# call function and print
raster_values_array = raster_values_at_centroids(attribute, date_range, centroids)

print(f"Raster values at county centroids {counties}:\n")
print(f"On dates: {start_date_str} - {end_date_str}")
print(raster_values_array)

Raster values at county centroids ['39001', '39083']:

On dates: 03/01/2023 - 03/04/2023
[[ 0.     0.   ]
 [ 3.676  0.   ]
 [ 3.05   0.   ]
 [19.941 37.54 ]]


## Monthly Data Experimentation

The above were basic examples, but what about for multiple locations and multiple days? Each date and attribute is a singular zip file, which means there would be 30 zip files (240 files total) downloaded if a user wanted the scope of a month. Below will test for a single month in 2024, in all counties in the state of Arizona, with two climate variables: precipitation and maximum temperature, which are measured in millimeters and degrees Celsius respectively.

### June 2024 Precipitation

**Setup**

In [5]:
start_date = datetype(2024, 6, 1)
end_date = datetype(2024, 6, 30)
TimeFrame.duration_days = 30
date_range = TimeFrame.range(start_date, end_date)

# list of all of the counties in Arizona
all_counties_az = [
    "04001",
    "04003",
    "04005",
    "04007",
    "04009",
    "04011",
    "04012",
    "04013",
    "04015",
    "04017",
    "04019",
    "04021",
    "04023",
    "04025",
    "04027",
]

# manual centroids for all counties in Arizona
centroids = np.array(
    [
        (-109.48884962, 35.3955288),
        (-109.75126314, 31.87963709),
        (-111.77052096, 35.83872483),
        (-110.81163686, 33.79970237),
        (-109.88745163, 32.9326627),
        (-109.24035541, 33.21540167),
        (-113.98157752, 33.72938684),
        (-112.49151144, 33.34903944),
        (-113.75790301, 35.70406832),
        (-110.32141935, 35.39955034),
        (-111.7898635, 32.09739903),
        (-111.3447399, 32.90436651),
        (-110.84651691, 31.52596126),
        (-112.55373567, 34.59984444),
        (-113.9056188, 32.76961884),
    ],
    dtype=CentroidDType,
)


#### Precipitation

**Getting Centroid Raster Values**

In [6]:
attribute = AttributeDef("precipitation", float, Shapes.NxN)

raster_values_array = raster_values_at_centroids(attribute, date_range, centroids)

print("Raster values in all counties in Arizona:")
print(raster_values_array)

start_date_str = start_date.strftime("%m/%d/%Y")
end_date_str = end_date.strftime("%m/%d/%Y")
print(f"\nDates: {start_date_str} - {end_date_str}")
print(f"\nCounties: {all_counties_az}")

Raster values in all counties in Arizona:
[[0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000e+00]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
  0.0000e+00 0.0000e+00 0.0000

**Observation:**

Arizona in general does not accumulate as much rain as other states in the east, so an output of low numbers for Arizona is to be expected. However, the opposite is anticipated for the maximum temperature, since Arizona is much warmer, especially in the summer.

### June 2024 Maximum Temperature

**Setup**

In [7]:
attribute = AttributeDef("max_temperature", float, Shapes.NxN)

**Getting Centroid Raster Values**

In [8]:
raster_values_array = raster_values_at_centroids(attribute, date_range, centroids)

print("Raster values in all counties in Arizona:")
print(raster_values_array)
print(f"\nDates: {start_date_str} - {end_date_str}")
print(f"\nCounties: {all_counties_az}")

Raster values in all counties in Arizona:
[[30.884 34.701 29.102 35.193 38.369 33.943 39.124 41.115 32.026 31.787
  36.58  39.248 34.635 28.623 40.465]
 [30.815 35.276 28.102 34.82  38.712 34.123 37.832 40.148 30.554 31.478
  36.292 39.252 35.138 27.446 39.924]
 [28.947 33.646 26.424 32.92  37.43  32.722 36.984 39.84  29.408 29.771
  35.303 38.266 33.723 26.545 39.45 ]
 [29.668 33.986 27.909 33.302 37.011 32.932 37.959 39.606 30.93  30.894
  34.831 37.689 33.046 27.478 38.922]
 [30.909 34.1   28.998 34.19  37.187 33.32  38.878 39.598 32.549 32.366
  35.653 37.534 34.108 29.691 38.939]
 [33.342 36.918 32.011 37.588 41.239 37.185 41.057 42.637 35.419 34.711
  38.664 40.332 37.039 32.263 40.448]
 [35.655 38.506 34.141 39.581 42.484 38.76  42.81  44.054 37.045 37.096
  39.943 41.985 37.749 33.875 42.072]
 [34.646 37.408 32.782 38.431 41.437 36.962 41.798 43.519 34.895 35.868
  39.881 42.844 36.999 31.118 42.492]
 [33.05  35.926 30.271 36.319 38.683 35.458 40.051 43.128 33.647 33.636
  37.3

## Space Taken Up from PRISM Files

The majority of the time spent running comes from fetching the files themselves rather than actually interpreting the raster data. However, the space has yet to be recorded. The following code takes in a basic file path and collects the amount of space all of the dates for those files takes up all together.

In [9]:
from datetime import date as datetype
from datetime import timedelta
from typing import List


def generate_file_paths(
    template: str, start_date: datetype, num_days: int
) -> List[str]:
    """Generate the file names"""
    file_paths = []
    # iterate through everyday
    for i in range(num_days):
        current_date = start_date + timedelta(days=i)
        # format each date
        formatted_date = current_date.strftime("%Y%m%d")
        file_path = template.replace("DATE", formatted_date)
        file_paths.append(file_path)
    return file_paths


def format_size(size_in_bytes: int) -> str:
    """Format the file size as a string"""
    for unit in ["B", "KB", "MB", "GB", "TB"]:
        # format based on the amount
        if size_in_bytes < 1024:
            return f"{size_in_bytes:.2f} {unit}"
        size_in_bytes /= 1024


def print_file_size(file_paths: List[str]):
    """Print the sum of all of the files sizes"""
    file_sum = 0
    # add up all of the file sizes
    for f in file_paths:
        file_path = Path(f)
        file_size_bytes = file_path.stat().st_size
        file_sum += file_size_bytes

    formatted_size = format_size(file_sum)
    print(f"PRISM file size: {formatted_size}")

### Precipitation Files

In [10]:
template = "PRISM_ppt_provisional_4kmD2_DATE_bil.zip"
file_paths = generate_file_paths(template, start_date, TimeFrame.duration_days)

print_file_size(file_paths)

PRISM file size: 34.09 MB


### Maximum Temperature Files

In [11]:
template = "PRISM_tmax_provisional_4kmD2_DATE_bil.zip"
file_paths = generate_file_paths(template, start_date, TimeFrame.duration_days)

print_file_size(file_paths)

PRISM file size: 56.93 MB


## Size Analysis

As shown by the file sizes for the precipitation against the maximum temperature files, the difference is quite large. The maximum temperature files take over 20 more MB of storage than the precipitation files. The likely reason is that maximum temperature has raster values for every single point on the raster grid, as each location has a maximum temperature of some value. Compare this with precipitation in which some locations do not receive any precipitation, leaving areas with 0s. The amount of space taken when fetching and downloading files is pretty significant for the range of a month. As for time, through this experimentation, it has been discovered that weak internet can drastically change the runtime for fetching the zip files from the PRISM website. In addition, having a large amount of files cached or having low storage space can also increase the time for downloading the files. In general, when running the ADRIO template for PRISM, ensure that the given machine has the storage to hold this amount of data and that the provided internet is decently strong.