In [1]:
import requests
import httpx
import asyncio

import time

import pandas as pd

## Systems metadata cleaning

After extracting the metadata for all systems, which are not confidential, the data is loaded as data frame and cleaned and structured.

In [2]:
systems_metadata = pd.read_csv('systems_metadata', index_col='Unnamed: 0')

In [3]:
systems_metadata.head(1)

Unnamed: 0,available_years,comments,confidential,inverter_mfg,inverter_model,module_mfg,module_model,site_power,module_tech,name_private,name_public,site_area,site_azimuth,site_elevation,site_latitude,site_longitude,site_tilt,system_id
0,"[2010, 2011, 2012, 2013, 2014, 2015, 2016, 201...",Having repeated problems with inverter main bo...,False,SatCon Technology,135kW,Sharp,NU-U240F1,146640.0,1.0,Andre Agassi Preparatory Academy - Building A,[34] Andre Agassi Preparatory Academy - Buildi...,996.0278,180.0,620.0,36.1952,-115.1582,11.2,34


In the systems metadata there are information about the location(`site_area`, `site_elevation`, `site_latitude`, `site_longitude`), the panel's geometrical configuration(`site_tilt`, `site_azimuth`), the equipment of the system(`inverter_model`, `module_model`, `site_power`, etc.) and the available years of records. There are records where the `available_years` is an empty list. Those can't be used for the further requests. Because of that reason they'll be dropped.

In [4]:
systems_metadata = systems_metadata.drop(systems_metadata[systems_metadata['available_years'] == '[]'].index)

In [5]:
systems_metadata.shape

(53, 18)

After dropping the 53 PV systems are left.

The data set is still not structured, because in order to add a record only a new year in the list of available years should be added. If data is structured for every new record a new row needs to be created. To achieve that the list will be exploded. 

In [6]:
systems_metadata['available_years'] = systems_metadata['available_years'].str.strip('[]').str.split(', ')

In [7]:
systems_metadata = systems_metadata.explode('available_years')

In [8]:
systems_metadata.head(2)

Unnamed: 0,available_years,comments,confidential,inverter_mfg,inverter_model,module_mfg,module_model,site_power,module_tech,name_private,name_public,site_area,site_azimuth,site_elevation,site_latitude,site_longitude,site_tilt,system_id
0,2010,Having repeated problems with inverter main bo...,False,SatCon Technology,135kW,Sharp,NU-U240F1,146640.0,1.0,Andre Agassi Preparatory Academy - Building A,[34] Andre Agassi Preparatory Academy - Buildi...,996.0278,180.0,620.0,36.1952,-115.1582,11.2,34
0,2011,Having repeated problems with inverter main bo...,False,SatCon Technology,135kW,Sharp,NU-U240F1,146640.0,1.0,Andre Agassi Preparatory Academy - Building A,[34] Andre Agassi Preparatory Academy - Buildi...,996.0278,180.0,620.0,36.1952,-115.1582,11.2,34


In [9]:
systems_metadata.shape

(424, 18)

The total amount of records after exploding every available year for a system in a new row is 424. 

Next part of that task is extracting data for every particular system for the available years. To accomplish that requests to the 'Annual Data CSV for a System' of NREL API will be made. The data for each system is exported to .csv file. Since 424 request should be made and the results saved to 53 files the time of execution of synchronous and asynchronous method will be compared.

## Request for annual data for a system - synchronous

Using synchronous programming every task is executed after the previous is done. In the case of requests there are significant amount of time, where the program is just waiting on the result. In this case when the `request_annual_data` is called and a request for the data for a particular year is made.

In [10]:
def request_annual_data(system, year):
    
    '''Request annual data for 
a system for particular year'''
    
    annual_data_request_result = requests.get(f'https://developer.nrel.gov/api/pvdaq/v3/data_file?api_key=xPPvwr5Jn6RvoUod52vgbckHsa0pX382wVSJwU0o&system_id={system}&year={year}')
    return annual_data_request_result

In [11]:
def export_request_result(system):
    
    ''' Export annual data for the years in the
parameter 'years' for particular system in .txt format'''
    
    available_years = systems_metadata[systems_metadata['system_id'] == 2]['available_years']

    with open(f"annual_data/annual_data_system_{system}.txt", "w") as f:
        for year in available_years:
            f.write(request_annual_data(system, year).text)

## Request for annual data for a system - asynchronous

In the case of asynchronous program in the moments when it's waiting on a result the process is not blocked and can proceed to next task, as long as that task does not need a result from any previous which is still not finished. 

In [12]:
async def get_year_URL(client, url):
    
    '''Request annual data for 
a system for particular year'''
    
    response = await client.get(url)
    return response

In [13]:
async def export_system_data(system):
    
    '''Create list of future object for every year of the particular system.
    Gather the data when results are ready and write them into a .txt file with ',' separator.
    Note: The resulting data will have headers before every years' first record.
    '''
    
    async with httpx.AsyncClient() as client: 
        tasks = []
        available_years = systems_metadata[systems_metadata['system_id'] == system]['available_years']
        for year in available_years:
            url = f'https://developer.nrel.gov/api/pvdaq/v3/data_file?api_key=xPPvwr5Jn6RvoUod52vgbckHsa0pX382wVSJwU0o&system_id={system}&year={year}' 
            tasks.append(asyncio.ensure_future(get_year_URL(client, url)))  
        
        result_data = await asyncio.gather(*tasks)
        
        with open(f"annual_data/ASYNC_annual_data_system_{system}.txt", "w") as f:
            for result in result_data:
                f.write(result.text)

## Time comparison between synchronous and asynchronous method

In [14]:
start = time.time()

export_request_result(2) 

time_synchronous = (time.time() - start)

In [15]:
start = time.time()

await export_system_data(2)

time_asynchronous = (time.time() - start)

In [16]:
print(f'The extraction of the annual data for system 2 synchronously took {time_synchronous} seconds.')
print(f'The extraction of the annual data for system 2 asynchronously took {time_asynchronous} seconds.')
print(f'Asynchronous program was {time_synchronous - time_asynchronous} seconds faster.')

The extraction of the annual data for system 2 synchronously took 81.7398943901062 seconds.
The extraction of the annual data for system 2 asynchronously took 26.65550470352173 seconds.
Asynchronous program was 55.08438968658447 seconds faster.
