# Phase 3: Feature data preprocessing
## Overview
In this section I will aquiring and cleaning data for the independent variables for each system. I already have: latitude, longitude, elevation_m, azimuth, tilt. I need to get data on: solar irradiance, temperature, cloud cover, wind speed, humidity, and precipitation. I'm using the NASA POWER to get this data.

## Importing needed libraries

In [14]:
import requests
import pandas as pd
from io import StringIO
import os
import time

sites_df = pd.read_csv('../../processed_data/useable_systems_metadata.csv')
display(sites_df.head())

Unnamed: 0,system_id,latitude,longitude,elevation_m,azimuth,tilt,dc_capacity_kW,mean_power_generation_kW,performance_ratio
0,10000,44.914573,-93.162525,288.79187,180.0,33.0,5.85,0.940219,0.160721
1,10001,39.483937,-76.301594,51.222542,180.0,20.0,3.36,0.920926,0.274085
2,10003,40.346434,-76.423645,145.150772,180.0,35.0,11.04,3.251637,0.294532
3,10005,33.198982,-97.150581,207.639435,180.0,19.0,15.665,3.438072,0.219475
4,10010,45.136089,-88.010933,202.512192,180.0,35.0,12.192,3.844456,0.315326


## Initial data inspection
First I'm going to look at how the data I need for each system is stored in the NASA POWER.

In [5]:
# These the variables I want data for: solar irradiance, air temp, cloud cover, wind speed, precipitation, humidity.
params = 'ALLSKY_SFC_SW_DWN,T2M,CLOUD_AMT,WS10M,PRECTOTCORR,RH2M'

# Sets the url for the request
base_url = "https://power.larc.nasa.gov/api/temporal/daily/point?parameters=" + params + "&community=RE"
system_url = base_url + "&longitude=-93.162525&latitude=44.914573&start=20190101&end=20191231&format=CSV"
response = requests.get(system_url)
print(response.text[:1000])

-BEGIN HEADER-
NASA/POWER Source Native Resolution Daily Data 
Dates (month/day/year): 01/01/2019 through 12/31/2019 in LST
Location: latitude  44.9146   longitude -93.1625 
elevation from MERRA-2: Average for 0.5 x 0.625 degree lat/lon region = 273.48 meters
The value for missing source data that cannot be computed or is outside of the sources availability range: -999 
parameter(s): 
ALLSKY_SFC_SW_DWN     CERES SYN1deg All Sky Surface Shortwave Downward Irradiance (kW-hr/m^2/day) 
T2M                   MERRA-2 Temperature at 2 Meters (C) 
CLOUD_AMT             CERES SYN1deg Cloud Amount (%) 
WS10M                 MERRA-2 Wind Speed at 10 Meters (m/s) 
PRECTOTCORR           MERRA-2 Precipitation Corrected (mm/day) 
RH2M                  MERRA-2 Relative Humidity at 2 Meters (%) 
-END HEADER-
YEAR,MO,DY,ALLSKY_SFC_SW_DWN,T2M,CLOUD_AMT,WS10M,PRECTOTCORR,RH2M
2019,1,1,1.998,-21.69,43.47,3.23,0.0,89.49
2019,1,2,1.3949,-13.4,64.74,4.29,0.02,94.78
2019,1,3,1.6867,-5.66,42.8,4.6


## Extracting data for each system
Next I'm going to download data for each year I downloaded system generation on and store it.

In [17]:
# The folder each file will be stored in
base_file_path = '../../raw_data/system_weather_data/'
base_folder_path_generation_data = '../../raw_data/system_generation_data/'

print("Started")
for index, row in sites_df.iterrows():
    sys_id = str(int(row['system_id']))
    years = []
    for i in range(5):
        csv = pd.read_csv(base_folder_path_generation_data + sys_id + '/' + str(i+1) + '.csv')
        years.append(csv.iloc[1,0][0:4])
        
    for year in years:    
        # Gets needed information
        lon = str(row['longitude'])system
        lat = str(row['latitude'])
        sys_id = str(int(row['system_id']))

        # Gets the data for that systems coordinates
        system_url = base_url + "&longitude=" + lon + "&latitude=" + lat + "&start=" + year + "0101&end=" + year + "1231&format=CSV"
        response = requests.get(system_url)

        system_folder_path = base_file_path + sys_id
        # Created a folder at that path if it doesn't exist
        if not os.path.exists(system_folder_path):
                os.makedirs(system_folder_path)
    
        # Saves that data
        df = pd.read_csv(StringIO(response.text), skiprows=14)
        df.to_csv(system_folder_path + '/' + year + '.csv', index=False)
    
        # Prevents spamming the server
        time.sleep(0.1)
        # Reports progress
    if index % 100 == 0:
        print(index)
        
print("Ended")

Started
0
100
200
300
400
500
600
700
800
Ended
