## Script to scrape load data from the website of the TSOC

This script uses selenium and geckodriver to download automatically the electricity consumption data available at the website of the TSOC. Normally, the data are available in 15-day tranches, so this script makes the process of getting multi-month or multi-year data easier.

The way it works is that we generate a list of start dates with 15 days difference of each other, covering the range of dates we want to download. Then, we formulate the URL and use Firefox in headless mode, to access the website and get the data.

Of course, if the TSOC changes their website, things will fail.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from datetime import date, timedelta
import time
import os

In [None]:
# Get list of data I already have. This way we don't need to re-download everything again
# and reduces stress on the website. Be kind.
mypath = './power_data/'
datafiles = [f for f in os.listdir(mypath) if os.path.isfile(os.path.join(mypath, f))]

# Usually, the last file downloaded included days with NaN values. If for example we ask for 15-days
# but there are 10 days available, it will fill in the rest with NaN. So, we remove the last
# file to get an updated one.
lastfile = datafiles.pop()
os.remove(mypath+lastfile)

# Extract the start dates from the file names
datafilesclean = []
for f in datafiles:
    datafilesclean.append(f[18:28])

In [None]:
# Generate the dates I'll be requesting. It's from the start date to today with jumps of 15 days
start_date = date(2019, 1, 1)
end_date = date.today()
delta = timedelta(days=15)
datelist = []
while start_date <= end_date:
    if start_date.strftime("%Y-%m-%d") not in datafilesclean:
        datelist.append(start_date.strftime("%d-%m-%Y"))
    start_date += delta
print(datelist)

In [None]:
# This is the XPATH of the button to click to get the excel file
# I got this by "inspecting" the website
myxpath = "/html/body/div[2]/div/div/div/div/article/div/div[11]/div[1]/button"

# This is the path to geckodriver.exe and the download directory 
# You need to have firefox installed as well for this to work
geckopath = r'C:\Users\p3tri\geckodriver.exe'
download_dir = 'C:\\Users\\p3tri\\OneDrive - Cyprus University of Technology\\Research projects\\2020\\EAC timeseries\\data\\'

# Tell to Firefox where to download the excel files automatically and where to put them
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", download_dir )
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")

options = webdriver.FirefoxOptions()
options.add_argument('-headless')

# Initialise the firefox browser
driver = webdriver.Firefox(executable_path=geckopath,firefox_profile=profile, options=options)

In [None]:
# Loop over the dates and download
for dt in datelist:
    # Generate the link to fetch the data
    url="https://tsoc.org.cy/archive-total-daily-system-generation-on-the-transmission-system/?startdt="+dt+"&enddt=%2B15days"
    # print(dt)
    driver.get(url)
    try:
        btn = WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, myxpath)))
    except TimeoutException:
        print("Loading took too much time: "+dt)
    time.sleep(5)
    btn.click()
    time.sleep(30)

In [None]:
# Close everything
driver.close()
driver.quit()