# Project Data Acquisition

This code downloads the [bchydro-outages](https://github.com/outages/bchydro-outages/tree/main) project (including all commit history) from GitHub and saves it to a sub directory.

Then it does some work to process it into a usable Pandas-compatible format.

See BC Hydro's frontend here: https://www.bchydro.com/power-outages/app/outage-map.html

## Configuration

In [None]:
# Be wary of how much compute big numbers require
# 14 days took ~30 seconds to run on my computer, plus ~15 seconds to download the repo - Bea
# 90 days took ~10 minutes to run on my less good laptop - Bea
# The file size of the CSV will probably never be more than 200Mb
# 150 days took ~2 minutes and 52 sec to run on my mac m1 air, total size 18.9 MB - Soumya
DAYS_TO_CAPTURE = 9
"""The number of days of history to capture in the returned Pandas Data"""

DELETE_OLD_REPO = False
"""If True, deletes the old repository data and starts fresh"""

## Fetch the Repository

In [None]:
import subprocess
import os
import shutil

target = os.getcwd()
repoName = "bchydro-outages"

repoPath = os.path.join(target, repoName)

if DELETE_OLD_REPO and os.path.exists(repoPath):
  # https://stackoverflow.com/a/6996628 <-- How to delete a directory in Python
  shutil.rmtree(repoPath)

# https://stackoverflow.com/a/4760517 <-- How to run subprocess in Python

# Clone the REPO and save to <repoName> folder under DataAcquisition/
result = subprocess.run(
  ["git", "clone", "https://github.com/outages/bchydro-outages.git", f"./{repoName}"],
  cwd=target,
  capture_output=True,
)
print(result.stdout.decode("utf-8"))
print(result.stderr.decode("utf-8"))

# Confirm that the repository was cloned
assert os.path.exists(repoPath)

# Process commit history

The repo only has 1 file in it: a `.json` file which shows the current (right now) outages being tracked by BC Hydro. To find historical data, we need to traverse the commit history and merge each commit together


### Step 1) Get the JSON data from each commit

In [None]:
import json

# Reset to main branch
subprocess.run(["git", "checkout", "main"], cwd=repoPath)

# Find the oldest commit from N days ago
afterTime = f"{DAYS_TO_CAPTURE} days ago"

commitLog = subprocess.run(
  ["git", "log", f"--since={afterTime}", "--pretty=format:%H:%ct", "--reverse"],
  cwd=repoPath,
  capture_output=True,
)
commits = commitLog.stdout.decode("utf-8")

commits = [commit.split(":") for commit in commits.split("\n")]

# Get JSON file for each commit

OUTAGES_FILE_NAME = "bchydro-outages.json"
outagesFilePath = os.path.join(repoPath, OUTAGES_FILE_NAME)


def getJSON(commit):
  subprocess.run(["git", "checkout", commit[0]], cwd=repoPath, capture_output=True)
  assert os.path.exists(os.path.join(repoPath, outagesFilePath))
  with open(outagesFilePath, "r") as f:
    # https://stackoverflow.com/q/20199126 <-- How to load JSON from a file
    return json.load(f)


jsonData = [getJSON(commit) for commit in commits]

### Step 2) Combine the JSON data

Because this tracks current outages, we never actually get to see the "timeOn" at the end. The `dateOn` field in the JSON is only an estimate time. 

But, we can interpret when the outage ended (within +-15 minutes) by seeing if it is present in the next commit. This is why we'll start from the latest commit and work towards present-day

This does mean that the most recent outages (any that are still ongoing) won't be added to the list, but that's fine

In [None]:
# Make sure to take all but start time from the latest entry for an outage
outages = []


waitingForEndTime = []
for i, data in enumerate(jsonData):
  commitTime = commits[i][1]

  # data is a list of active outages
  activeOutageIds = [outage["id"] for outage in data]

  # Check if any of the outages in waitingForEndTime are in the active outages
  # Push to final array if they aren't

  newWaiting = []
  for waitingOutage in waitingForEndTime:
    if waitingOutage["id"] not in activeOutageIds:
      waitingOutage["endTime"] = commitTime
      outages.append(waitingOutage)
    else:
      newWaiting.append(waitingOutage)
  waitingForEndTime = newWaiting

  # Override any outages waiting for time with latest data
  for i, waitingOutage in enumerate(waitingForEndTime):
    for outage in data:
      if waitingOutage["id"] == outage["id"]:
        waitingForEndTime[i] = outage
        break

  # Add any new outages to waitingForEndTime
  for outage in data:
    if outage["id"] not in [waitingOutage["id"] for waitingOutage in waitingForEndTime]:
      waitingForEndTime.append(outage)

print(f"Raw Outages Indexed: {len(outages)}")
print(f"Active Outages (Not recorded): {len(waitingForEndTime)}")

### Step 3) Cleanup 

Delete unneeded fields and convert all times to proper datetime objects

Some terminology:
- "eta" is the estimated time of arrival for the repair crew
- "etr" is the estimated time of restoration for the power

In [None]:
import pandas as pd
from common import timeStampFields

pdOutages = pd.DataFrame(outages)

# dateOn is BC Hydro's estimate of when power would be restored, `endTime` is the actual measure time from when it was removed from the active outages page
# So let's rename `dateOn` to `estDateOn` and `endTime` to `dateOn`
pdOutages.rename(
  {"dateOn": "estDateOn", "endTime": "dateOn"}, axis="columns", inplace=True
)

# Let's delete the columns that aren't useful
pdOutages.drop(
  columns=[
    "showEta",  # Only useful for BC Hydro's website
    "showEtr",  # Only useful for BC Hydro's website
    "crewStatusNote",  # Not specified format (and usually empty)
    "crewStatusDescription",  # Doesn't update once outage is resolved
    "crewStatus",  # Doesn't update once outage is resolved
  ],
  inplace=True
)

# Convert to datetime (some timestamps are in ms unix and some are in s unix so this accounts for that)
for field, unit in timeStampFields:
  pdOutages[field] = pd.to_datetime(pdOutages[field], unit=unit, utc=True)

# Sort
pdOutages = pdOutages.sort_values(by=['dateOff', 'dateOn'])

# Export

In [None]:
from common import csvFileName

pdOutages.to_csv(csvFileName, index=False)