# Dataset Generation Part 4: Assembling & Labeling the Data

In this part, we'll finalize the dataset. We'll proceed by:

1. Gathering the EnergyPlus results into an URBANopt like `results.json`
2. Using ChatGPT to label the data (not 1s or 0s, but recommendation "levels"):

$$\left[ \text{insulation}, \text{windows}, \text{HVAC}, \text{seal} \right]$$

In [1]:
%pip install openai pandas dotenv

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import json
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI()

## 1. Compile `results.json`

In [3]:
def extract_results_from_csv(home_dir: str) -> dict:
    """
    Extracts summary statistics and hourly zone temperatures from an EnergyPlus CSV output.

    Parameters:
        home_dir (str): Name of the home (i.e., RAMBEAU_RD_15)

    Returns:
        dict: Structured dictionary of zone-level features.
    """
    df = pd.read_csv(f"dataset/{home_dir}/simulation_output/eplusout.csv")

    # Normalize column names
    mean_air_col = "MAINZONE:Zone Mean Air Temperature [C](Hourly)"
    air_col = "MAINZONE:Zone Air Temperature [C](Hourly) "
    
    def compute_stats(series):
        return {
            "average": round(series.mean(), 3),
            "min": round(series.min(), 3),
            "max": round(series.max(), 3),
            "hourly": [round(x, 3) for x in series.tolist()]
        }

    return {
        "zone": "MAINZONE",
        "features": {
            "mean_air_temperature": compute_stats(df[mean_air_col]),
            "air_temperature": compute_stats(df[air_col])
        }
    }

## 2. Label Data with OpenAI API

In [13]:
def label_data(results_json, inspection_report, home_dir_name):
  """
  Uses OpenAI API to label a datapoint based on its results.json and inspection report.
  """

  def build_prompt(results_json: dict, inspection_report: str) -> str:
    return f"""
You are an expert building energy analyst.

Below is structured simulation data for a building, followed by a narrative inspection report.

Your task is to assign a **confidence score** in the range [0, 1] for the **need** for each of the following retrofits:
- Insulation upgrade
- Window upgrade
- HVAC upgrade
- Sealing

A value of 0 means "definitely not needed". A value of 1 means "definitely needed". Intermediate values (e.g. 0.33, 0.5, 0.75) indicate uncertainty or partial need. Use your judgment to assign realistic values based solely on the data and report.

### SIMULATION DATA (JSON):
{json.dumps(results_json, indent=2)}

### INSPECTION REPORT (free text):
\"\"\"
{inspection_report}
\"\"\"

### RESPONSE FORMAT:
Return a JSON object like:
{{
  "insulation": 0.5,
  "windows": 0.0,
  "hvac": 1.0,
  "sealing": 0.25
}}

Only include the JSON. No explanation or commentary.
"""
  
  prompt = build_prompt(results_json, inspection_report)
    
  response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.2
    )
  
  content = response.choices[0].message.content
  
  try:
    label_path = f"dataset/{home_dir_name}/label.json"
    with open(label_path, "w") as f:
        json.dump(json.loads(content), f, indent=2)
  except json.JSONDecodeError:
    raise ValueError(f"Failed to parse model response:\n{content}")


In [14]:
def process_results(home_dir_name):
  results_json = extract_results_from_csv(home_dir_name)
  json.dump(results_json, open(f'dataset/{home_dir_name}/results.json', 'w'))

  inspection_note = json.load(open(f'dataset/{home_dir_name}/cleaned.geojson', 'r'))["features"][0]["properties"]["inspection_note"]

  label_data(results_json, inspection_note, home_dir_name)

## 3. Test

Now, we'll test this pipeline with one example (`RAMBEAU_RD_15`)

In [15]:
process_results('RAMBEAU_RD_15')