# Phase 3: Feature data preprocessing
## Overview
In this section I will aquiring and cleaning data for the independent variables for each system. I already have: latitude, longitude, elevation_m, azimuth, tilt. I need to get data on: solar irradiance, temperature, cloud cover, wind speed, humidity, and precipitation. I'm using the NASA POWER to get this data.

## Importing needed libraries

In [26]:
import requests
import pandas as pd
from io import StringIO
import time

sites_df = pd.read_csv('../processed_data/useable_systems_metadata.csv')
display(sites_df.head())

Unnamed: 0,system_id,latitude,longitude,elevation_m,azimuth,tilt,dc_capacity_kW,mean_power_generation_kW,performance_ratio
0,10000,44.914573,-93.162525,288.79187,180.0,33.0,5.85,0.940219,0.160721
1,10001,39.483937,-76.301594,51.222542,180.0,20.0,3.36,0.920926,0.274085
2,10003,40.346434,-76.423645,145.150772,180.0,35.0,11.04,3.251637,0.294532
3,10005,33.198982,-97.150581,207.639435,180.0,19.0,15.665,3.438072,0.219475
4,10010,45.136089,-88.010933,202.512192,180.0,35.0,12.192,3.844456,0.315326


## Initial data inspection
First I'm going to look at how the data I need for each system is stored in the NASA POWER.

In [48]:
# These the variables I want data for: solar irradiance, air temp, cloud cover, wind speed, precipitation, humidity.
params = 'ALLSKY_SFC_SW_DWN,T2M,CLOUD_AMT,WS10M,PRECTOTCORR,RH2M'

# Sets the url for the request
base_url = "https://power.larc.nasa.gov/api/temporal/daily/point?parameters=" + params + "&community=RE"
system_url = base_url + "&longitude=" + lon + "&latitude=" + lat + "&start=20190101&end=20191231&format=CSV"
response = requests.get(system_url)
print(response.text[:1000])

-BEGIN HEADER-
NASA/POWER Source Native Resolution Daily Data 
Dates (month/day/year): 01/01/2019 through 12/31/2019 in LST
Location: latitude  38.9519   longitude -76.9468 
elevation from MERRA-2: Average for 0.5 x 0.625 degree lat/lon region = 70.81 meters
The value for missing source data that cannot be computed or is outside of the sources availability range: -999 
parameter(s): 
ALLSKY_SFC_SW_DWN     CERES SYN1deg All Sky Surface Shortwave Downward Irradiance (kW-hr/m^2/day) 
T2M                   MERRA-2 Temperature at 2 Meters (C) 
CLOUD_AMT             CERES SYN1deg Cloud Amount (%) 
WS10M                 MERRA-2 Wind Speed at 10 Meters (m/s) 
PRECTOTCORR           MERRA-2 Precipitation Corrected (mm/day) 
RH2M                  MERRA-2 Relative Humidity at 2 Meters (%) 
-END HEADER-
YEAR,MO,DY,ALLSKY_SFC_SW_DWN,T2M,CLOUD_AMT,WS10M,PRECTOTCORR,RH2M
2019,1,1,1.205,10.05,99.73,6.31,0.5,87.52
2019,1,2,1.0651,4.4,97.54,2.68,0.02,89.49
2019,1,3,1.2348,4.51,96.1,3.21,0.2


## Extracting data for each system
Next I'm going to download this data for each system and store it.

In [29]:
# The folder each file will be stored in
base_file_path = '../raw_data/system_weather_data/'

print("Started")
for index, row in sites_df.iterrows():
    # Gets needed information
    lon = str(row['longitude'])
    lat = str(row['latitude'])
    sys_id = str(int(row['system_id']))

    # Gets the data for that systems coordinates
    system_url = base_url + "&longitude=" + lon + "&latitude=" + lat + "&start=20190101&end=20191231&format=CSV"
    response = requests.get(system_url)

    # Saves that data
    df = pd.read_csv(StringIO(response.text), skiprows=14)
    df.to_csv(base_file_path + sys_id + '.csv', index=False)

    # Prevents spamming the server
    time.sleep(1)
    # Reports progress
    if index % 100 == 0:
        print(index)
        
print("Ended")

Started
0
100
200
300
400
500
600
700
800
Ended


## Final data preprocessing
Now I'm going to use all that data ive downloaded. I will also be dropping the data on capacity and generation as I won't need this for my model.

In [47]:
final_systems_metadata_df = sites_df
# The column names I'll be using
variables = ['solar irradiance', 'temperature', 'cloud cover', 'wind speed', 'humidity', 'precipitation']
# Names of the variables in the data I downloaded
variable_names = ['ALLSKY_SFC_SW_DWN', 'T2M', 'CLOUD_AMT', 'WS10M', 'PRECTOTCORR', 'RH2M']

# Creates all of the columns we need
for variable in variables:
    final_systems_metadata_df[variable] = None
      
for index, row in final_systems_metadata_df.iterrows():
    sys_id = str(int(row['system_id']))
    # Reads the data for that system
    df = pd.read_csv(base_file_path + sys_id + '.csv')

    # Finds the mean for each varibale and stores it in the df
    for i, variable in enumerate(variables):
        final_systems_metadata_df.loc[index, variable] = df[variable_names[i]].mean()

final_df = final_systems_metadata_df.drop(columns=['dc_capacity_kW', 'mean_power_generation_kW'])
display(final_df.head())
final_df.to_csv('../processed_data/final_systems_metadata.csv', index=False)


Unnamed: 0,system_id,latitude,longitude,elevation_m,azimuth,tilt,performance_ratio,solar irradiance,temperature,cloud cover,wind speed,humidity,precipitation
0,10000,44.914573,-93.162525,288.79187,180.0,33.0,0.160721,3.743916,5.400575,63.48326,3.976438,3.027014,83.503699
1,10001,39.483937,-76.301594,51.222542,180.0,20.0,0.274085,4.093048,12.999041,63.736466,3.496603,2.970356,77.229479
2,10003,40.346434,-76.423645,145.150772,180.0,35.0,0.294532,3.873484,10.765973,66.073288,2.225288,3.382959,80.013945
3,10005,33.198982,-97.150581,207.639435,180.0,19.0,0.219475,4.70751,18.458411,50.966411,4.469096,2.365068,71.625863
4,10010,45.136089,-88.010933,202.512192,180.0,35.0,0.315326,3.561523,5.190986,66.488795,4.549726,3.327918,85.779178
