In [1]:
import os
from pathlib import Path
import pandas as pd

Set the `SYSCAPS` environment variable to the path where the datasets are stored under. The data should be stored under a directory called `BuildingsBench`. For example, in my system `SYSCAPS` is set to `/projects/foundation/v2.0.0/BuildingsBench`.

Many of the Python scripts in this project use the `SYSCAPS` environment variable to locate data and metadata.

In [2]:
os.environ['SYSCAPS'] = '/projects/cascde/pemami/v2.0.0/BuildingsBench/'

In [6]:
# The dataset directory 
SYSCAPS_PATH = os.environ.get('SYSCAPS', '')
if SYSCAPS_PATH == '':
    raise ValueError('SYSCAPS environment variable not set')
SYSCAPS_PATH = Path(SYSCAPS_PATH)

The Buildings dataset has the LLM-generated natural language SysCaps captions stored in a single csv called `captions.csv`. 

In [4]:
# `captions` is the name of the subdirectory containing the captions
# `comstock` (Commercial Building stock) is the detailed name of the Buildings dataset
#  `medium` is the caption length (short, medium, long)
df = pd.read_csv(SYSCAPS_PATH / 'captions' / 'comstock' / 'medium' / 'captions.csv', index_col=None, header=0)
df

Unnamed: 0,building_id,caption
0,1,The FullServiceRestaurant is a single-story b...
1,2,This Full Service Restaurant is a single-stor...
2,3,This strip mall restaurant is a single-story ...
3,4,Sure! Here's a building description based on ...
4,5,The retail standalone building is a single-st...
...,...,...
347169,349996,The primary school is a single-story building...
347170,349997,"Sure, here's a building description based on ..."
347171,349998,This three-story outpatient medical building ...
347172,349999,The building in question is a single-story st...


The attributes used to create these captions are stored in parquet files under the `Buildings-900K` dataset sub-directory: 

In [10]:
buildings_bench_path = SYSCAPS_PATH / 'Buildings-900K' / 'end-use-load-profiles-for-us-building-stock' / '2021'
df1 = pd.read_parquet(buildings_bench_path / 'comstock_amy2018_release_1' / 'metadata' / 'metadata.parquet', engine="pyarrow")
df2 = pd.read_parquet(buildings_bench_path / 'comstock_tmy3_release_1' / 'metadata' / 'metadata.parquet', engine="pyarrow")
df = df1.loc[ df1.index.intersection(df2.index).values ]
df.index

Index([], dtype='int64')

In [11]:
df1

In [14]:
# the attribute parquet files are indexed by building id
df.loc[100000].to_dict()

{'applicability': True,
 'in.upgrade_name': 'Baseline',
 'in.tstat_clg_delta_f': 0.0,
 'in.tstat_clg_sp_f': 999.0,
 'in.tstat_htg_delta_f': 7.0,
 'in.tstat_htg_sp_f': 999.0,
 'in.aspect_ratio': 4.0,
 'in.building_subtype': None,
 'in.county': 'G2101110',
 'in.building_type': 'Warehouse',
 'in.rotation': 0.0,
 'in.number_of_stories': 1.0,
 'in.sqft': 350000.0,
 'in.hvac_system_type': 'Gas unit heaters',
 'in.weekday_operating_hours': 18.75,
 'in.weekday_opening_time': 4.0,
 'in.weekend_operating_hours': 12.0,
 'in.weekend_opening_time': 9.25,
 'in.energy_code_followed_during_last_exterior_lighting_replaceme': 'ComStock 90.1-2007',
 'in.energy_code_followed_during_last_hvac_replacement': 'ComStock 90.1-2010',
 'in.energy_code_followed_during_last_interior_equipment_replacem': 'ComStock 90.1-2010',
 'in.energy_code_followed_during_last_interior_lighting_replaceme': 'ComStock 90.1-2019',
 'in.energy_code_followed_during_last_roof_replacement': 'ComStock DOE Ref 1980-2004',
 'in.energy_code

We only use a subset of these attributes. The filtered list of attribute names for commercial buildings used in our work is stored in a file called `attributes_comstock.txt`.

In [9]:
attributes = open(SYSCAPS_PATH / 'metadata' / 'syscaps' / 'energyplus' / 'attributes_comstock.txt', 'r').read().strip().split('\n')
# remove empty strings and quotations
attributes = [x.strip('"') for x in attributes]
list(filter(None, attributes))

['in.building_subtype',
 'in.building_type',
 'in.number_of_stories',
 'in.sqft',
 'in.hvac_system_type',
 'in.weekday_operating_hours',
 'in.weekday_opening_time',
 'in.weekend_operating_hours',
 'in.weekend_opening_time',
 'in.tstat_clg_delta_f',
 'in.tstat_clg_sp_f',
 'in.tstat_htg_delta_f',
 'in.tstat_htg_sp_f']

We can repeat this to explore the Wind dataset. 

In [21]:
!ls -l $SYSCAPS/captions/wind/medium/aug_0 | head -n 5

total 2000
-rw-rwxr-- 1 pemami foundation 357 Feb 19  2024 Layout000_cap.txt
-rw-rwxr-- 1 pemami foundation 372 Feb 19  2024 Layout001_cap.txt
-rw-rwxr-- 1 pemami foundation 548 Feb 19  2024 Layout002_cap.txt
-rw-rwxr-- 1 pemami foundation 548 Feb 19  2024 Layout003_cap.txt
ls: write error: Broken pipe


In [24]:
# `captions` is the name of the subdirectory containing the captions
# `wind` is the name of the Wind dataset (FLORIS is the simulator name)
#  `medium` is the caption length
#  `aug_0` is the style augmentation type
for idx in range(4):
    with open(Path(SYSCAPS_PATH) /  'captions' / 'wind' / 'medium' / 'aug_0' / f'Layout{idx:03}_cap.txt') as f:
        print( f.read() )
        print()

 Nestled within a picturesque landscape, this wind plant boasts a cluster layout, housing a total of 101 turbines that stand tall at an impressive 130 meters in diameter. With a deliberate spacing of five times the rotor diameter, the turbines are strategically positioned to maximize energy production, resulting in a combined rated power of 3.4 megawatts.

 This wind plant features a single-string configuration with 30 turbines, each boasting a rotor diameter of 130 meters. The turbines are spaced at an average distance of four times the rotor diameter, resulting in a highly efficient energy capture. Each turbine is capable of generating 3.4 megawatts of power, making it a significant contributor to the local energy grid.

 Sure, here's a description of the wind plant based on the provided attributes:
Located in a vast, open area, this wind plant is arranged in a cluster formation, with 188 turbines standing tall and proud, each with a rotor diameter of 130 meters. The turbines are spa