
# CSC 786 – Data Ethics & Reproducibility Workshop

In [15]:
%env GITHUB_TOKEN=<PLACE TOKEN HERE>

!git config --global user.name "Ashton R." ## Display name not necessarily your username
!git config --global user.email "ashton.ruesch@gmail.com"

env: GITHUB_TOKEN=<PLACE TOKEN HERE>


In [3]:
# 1. Clone your existing repo from GitHub
!git clone https://github.com/FalloutL0rd/csc786-ethics-demo # todo update url
%cd csc786-ethics-demo


# 2. Optional: verify remote
!git remote -v


# 3. If you make changes and want to push again
!git remote set-url origin https://github.com/FalloutL0rd/csc786-ethics-demo # todo update url

!git add .
!git commit -m "Update from Colab session"
!git push


Cloning into 'csc786-ethics-demo'...
remote: Enumerating objects: 32, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 32 (delta 6), reused 31 (delta 5), pack-reused 0 (from 0)[K
Receiving objects: 100% (32/32), 10.39 KiB | 3.46 MiB/s, done.
Resolving deltas: 100% (6/6), done.
/content/csc786-ethics-demo
origin	https://github.com/FalloutL0rd/csc786-ethics-demo (fetch)
origin	https://github.com/FalloutL0rd/csc786-ethics-demo (push)
On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
fatal: could not read Username for 'https://github.com': No such device or address


In [4]:
!git config --global --list

user.name=Ashton R.
user.email=ashton.ruesch@gmail.com


In [5]:
!pip -q install pandas requests python-dotenv

import os, sys, json, time, hashlib
import pandas as pd
import requests
from datetime import datetime, timezone
from pathlib import Path

#Folder base
ROOT = Path("/content/csc786-ethics-demo") if "google.colab" in str(get_ipython()) else Path(".")
DATA = ROOT / "data" / "processed"
ROOT.mkdir(parents=True, exist_ok=True)
DATA.mkdir(parents=True, exist_ok=True)

print("Environment ready. Files will be stored in:", DATA)


Environment ready. Files will be stored in: /content/csc786-ethics-demo/data/processed


In [6]:
from pathlib import Path
ROOT = Path("/content/csc786-ethics-demo")

# 1 - README.md  (general project overview)
readme_text = """# Data Collection - EIA Residential Electricity Price

**Real project:** IEC 61850 GOOSE messaging + ICS security.

- I looked for any public APIs exposing GOOSE/ICS/substation communication data. None exist publicly (due to critical-infrastructure protections).
- No other related APIs I could find would provide a direct benefit to my project or related paper.
- As a result, I switched to something more straightforward that still allowed me to get the practice of this process.

This pulls monthly residential electricity price for South Dakota from the EIA (Energy Information Administration).
- Pulls a small time window (last N years).
- Normalizes the response and saves one clean CSV.
- Logs a tiny provenance line (endpoint, params, timestamp, output, rows).

## What this includes
- A notebook that:
  - Calls `https://api.eia.gov/v2/electricity/retail-sales/data/`
  - Filters `state = SD`, `sector = RES`, `frequency = monthly`
  - Converts EIA `price` cents/kWh into USD/kWh for clean look
  - Saves to `data/processed/eia_price_residential_SD_<start>_to_<end>.csv`
  - Appends one JSON line to `DATA_README.md` (key is masked)

## How to run
1. Get a free key: https://www.eia.gov/opendata/index.php
2. In the notebook:
  - Paste your key or set `EIA_API_KEY` as an env var
  - Pick how many years back (`YEARS_BACK`)
  - Run all cells

Example config:
```python
STATE = "SD"
SECTOR = "RES"
YEARS_BACK = 6


"""
(ROOT / "README.md").write_text(readme_text)


# 2 - ETHICS.md  → ethical statement / responsible data use
ethics_text = """# Ethical Statement

## Data source
- U.S. Energy Information Administration (EIA) Open Data v2 API (key-based, public statistics)
- Endpoint: `https://api.eia.gov/v2/electricity/retail-sales/data/`
- Metric: `price` (reported in cents/kWh, converted here to USD/kWh)

## Principles
- Public statistical data with no PII.
- Usage follows provider terms and reasonable request rates.
- Transformations are minimal and transparent (column selection + unit conversion).
- API key handled locally and masked in logs.

## Potential risks / limitations
- **Revisions**: official series can be updated over time.
- **Representativeness**: state-level averages are not any single customer’s bill.
- **Units**: the source reports in cents/kWh, so unit clarity matters.

## Mitigations
- Keep an append-only provenance log (`DATA_README.md`) with endpoint, parameters, timestamp, output path, and row count.
- Explicit unit conversion documented in code and README.
- Pin the time window (`start`, `end`) in logs so results are reproducible.
"""
(ROOT / "ETHICS.md").write_text(ethics_text)


# 3 - DATA_README.md  → provenance log (append-only)
data_readme = ROOT / "DATA_README.md"
if not data_readme.exists():
    data_readme.write_text(
        "# Data Provenance Log (append-only)\nEach entry documents one data-collection event.\n---\n",
        encoding="utf-8"
    )

print("Created reproducibility files:")
!ls -lh /content | grep .md

Created reproducibility files:


In [14]:
%env EIA_API_KEY=<PLACE KEY HERE>

API_KEY = os.getenv("EIA_API_KEY")
print("Key loaded:", API_KEY[:6] + "****" if API_KEY else "No key found")

env: EIA_API_KEY=<PLACE KEY HERE>
Key loaded: <PLACE****


In [8]:
STATE      = "SD"     #South Dakota
SECTOR     = "RES"    #Residential
YEARS_BACK = 6        #How far back

now = datetime.now(timezone.utc)
start_year = now.year - YEARS_BACK
START = f"{start_year}-01"
END   = f"{now.year}-{now.month:02d}"
EIA_ENDPOINT = "https://api.eia.gov/v2/electricity/retail-sales/data/"

def fetch_eia_price_monthly(start_ym, end_ym, api_key, state="SD", sector="RES"):
    params = {
        "api_key": api_key,
        "frequency": "monthly",
        "data[0]": "price",
        "facets[stateid][]": state,
        "facets[sectorid][]": sector,
        "start": start_ym,
        "end": end_ym,
        "sort[0][column]": "period",
        "sort[0][direction]": "asc",
        "offset": 0,
        "length": 5000
    }
    r = requests.get(EIA_ENDPOINT, params=params, timeout=30)
    r.raise_for_status()
    rows = r.json().get("response", {}).get("data", [])
    df = pd.DataFrame(rows)
    if df.empty:
        return df, params

    if "state" not in df.columns and "stateid" in df.columns:
        df["state"] = df["stateid"]
    if "sector" not in df.columns and "sectorid" in df.columns:
        df["sector"] = df["sectorid"]

    df["price_usd_per_kwh"] = pd.to_numeric(df["price"], errors="coerce") / 100.0
    return df[["period","state","sector","price_usd_per_kwh"]], params

df, USED_PARAMS = fetch_eia_price_monthly(START, END, API_KEY, state=STATE, sector=SECTOR)
df.head()

Unnamed: 0,period,state,sector,price_usd_per_kwh
0,2019-01,SD,RES,0.1055
1,2019-02,SD,RES,0.1045
2,2019-03,SD,RES,0.1062
3,2019-04,SD,RES,0.1151
4,2019-05,SD,RES,0.1214


In [9]:
timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ")
out_csv = DATA / f"eia_price_residential_{STATE}_{START}_to_{END}.csv"
df.to_csv(out_csv, index=False)

file_hash = hashlib.sha256(out_csv.read_bytes()).hexdigest()

meta = {
    "timestamp_utc": timestamp,
    "endpoint": EIA_ENDPOINT,
    "params": {**USED_PARAMS, "api_key": "***"},   # mask key
    "output": str(out_csv).replace("\\","/"),
    "sha256": file_hash,
    "python": sys.version.split()[0],
    "pandas": pd.__version__,
    "requests": requests.__version__,
}
with open(ROOT / "DATA_README.md", "a", encoding="utf-8") as f:
    f.write("\n- " + json.dumps(meta))

print(f"Saved {out_csv.name}, hash={file_hash[:10]}…  →  {out_csv}")
!tail -n 3 "{ROOT}/DATA_README.md"


Saved eia_price_residential_SD_2019-01_to_2025-10.csv, hash=1502c437d8…  →  /content/csc786-ethics-demo/data/processed/eia_price_residential_SD_2019-01_to_2025-10.csv

- {"timestamp_utc": "2025-10-22T00:07:27Z", "endpoint": "https://api.eia.gov/v2/electricity/retail-sales/data/", "params": {"api_key": "***", "frequency": "monthly", "data[0]": "price", "facets[stateid][]": "SD", "facets[sectorid][]": "RES", "start": "2019-01", "end": "2025-10", "sort[0][column]": "period", "sort[0][direction]": "asc", "offset": 0, "length": 5000}, "output": "/content/csc786-ethics-demo/data/processed/eia_price_residential_SD_2019-01_to_2025-10.csv", "sha256": "1502c437d81bdcaa194c4d77590c4c8b8fd7fb19649f85d5ef25ca60e8f0ab92", "python": "3.12.12", "pandas": "2.2.2", "requests": "2.32.4"}
- {"timestamp_utc": "2025-10-22T00:35:14Z", "endpoint": "https://api.eia.gov/v2/electricity/retail-sales/data/", "params": {"api_key": "***", "frequency": "monthly", "data[0]": "price", "facets[stateid][]": "SD", "facets

In [10]:
!ls -lh /content
!ls -lh /content/data
!head -n 5 README.md
!tail -n 5 DATA_README.md

total 4.0K
drwxr-xr-x 5 root root 4.0K Oct 22 00:33 csc786-ethics-demo
ls: cannot access '/content/data': No such file or directory
# Data Collection - EIA Residential Electricity Price

**Real project:** IEC 61850 GOOSE messaging + ICS security.

- I looked for any public APIs exposing GOOSE/ICS/substation communication data. None exist publicly (due to critical-infrastructure protections).

---

- {"timestamp_utc": "2025-10-22T00:07:27Z", "endpoint": "https://api.eia.gov/v2/electricity/retail-sales/data/", "params": {"api_key": "***", "frequency": "monthly", "data[0]": "price", "facets[stateid][]": "SD", "facets[sectorid][]": "RES", "start": "2019-01", "end": "2025-10", "sort[0][column]": "period", "sort[0][direction]": "asc", "offset": 0, "length": 5000}, "output": "/content/csc786-ethics-demo/data/processed/eia_price_residential_SD_2019-01_to_2025-10.csv", "sha256": "1502c437d81bdcaa194c4d77590c4c8b8fd7fb19649f85d5ef25ca60e8f0ab92", "python": "3.12.12", "pandas": "2.2.2", "requests

In [11]:
!git remote set-url origin https://FalloutL0rd:$GITHUB_TOKEN@github.com/FalloutL0rd/csc786-ethics-demo.git

!git add .
!git commit -m "Update from Colab session"
!git push

[main 1ccd824] Update from Colab session
 1 file changed, 2 insertions(+), 1 deletion(-)
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 2 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 311 bytes | 311.00 KiB/s, done.
Total 3 (delta 2), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (2/2), completed with 2 local objects.[K
To https://github.com/FalloutL0rd/csc786-ethics-demo.git
   7949047..1ccd824  main -> main



# Reflection

I first tried to find a public API with IEC 61850/GOOSE or broader ICS data, but nothing legitimate is openly available for security reasons or does not exist. Instead, I decided to use the EIA API to pull monthly residential electricity price for South Dakota. I set my key locally, picked a simple parameter set (state, sector, frequency, date range), requested JSON, normalized to a tidy DataFrame, converted price from cents/kWh to USD/kWh, saved one CSV, and logged a short provenance entry (endpoint, masked key, params, timestamp, output, rows). This gave me practice with handling an API key safely, building a tiny but reproducible pipeline, and documenting exactly what I pulled. The dataset is useful for month-to-month change, year-over-year comparisons and it's easy to extend to a different sectors, other states, or longer time frames.