# Phase 1: Feature extraction
## Overview
Here I will be looking through the PVDAQ solar panel metadata csv file and I will decide what solar panel systems to use.

## Initial data inspection

In [1]:
import pandas as pd
metadata_df = pd.read_csv('../raw_data/systems_metadata.csv')
display(metadata_df.head())

Unnamed: 0,system_id,system_public_name,site_location,timezone_or_utc_offset,latitude,longitude,elevation_m,dc_capacity_kW,kg_climate,pvcz_composite,...,number_records,dataset_size_mb,available_sensor_channels,qa_status,qa_issue,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,2,Residential 1a,"Lakewood, CO",America/Denver,39.7214,-105.0972,1675.0,2.912,Dfb,12,...,13685898.0,313.25,7,fail,less than 1.0 years data,,,,,
1,3,Residential 1b,"Lakewood, CO",America/Denver,39.7214,-105.0972,1675.0,2.72,Dfb,12,...,12668178.0,289.95,7,fail,,,,,,
2,4,NREL x-Si -1,"Golden, CO",7,39.7406,-105.1774,1795.3,1.0,BSk,12,...,113978017.0,2608.75,15,pass,"Filtered time series less than 1.0 years data,...",,,,,
3,10,NREL CIS -1,"Golden, CO",7,39.7404,-105.1774,1792.8,1.12,BSk,12,...,113103574.0,2588.74,14,pass,Filtered time series less than 1.0 years data,,,,,
4,33,Silicor Materials,"Golden, CO",7,39.7404,-105.1772,1794.0,2.4,BSk,12,...,113673602.0,2601.78,15,pass,"Percent clipping exceeded threshold of 10%, Fi...",,,,,


## Filtering what systems I need
Firstly I need solar panels with a large amount of high quality data therefore the data will be filtered for 5+ years and passing quality assurance tests. Next I will look at solar panels with my desired configuration, single axis tracking and ground mounted solar panels. 

From these results, the tracking is mostly fixed and the type count is mostly unkown. This creates quite a big problem for my origional planned configuration, instead I will switching to a fixed system and use all types of solar panels. The main differences between roof and ground mounted solar panels are the tilt of the panels and scale of the farm. Using the tilt as one of my features and using specific yield should allow me to create the model without having to know the type.

In [2]:
value_type_counts = metadata_df['type'].value_counts()
value_tracking_counts = metadata_df['tracking'].value_counts()

print('Type count:')
for item, count in value_type_counts.items():
    print(item, count)

print('\nTracking count:')
for item, count in value_tracking_counts.items():
    print(item, count)

Type count:
unknown 1498
roof 99
ground 13
single-axis tracker 5
Unknown 3
carport 2
dual-axis tracker 1

Tracking count:
fixed 1613
tracking 8


In [3]:
filtered_df = metadata_df[
    (metadata_df['years'] > 5) &
    (metadata_df['tracking'] == 'fixed') &
    (metadata_df['qa_status'] == 'pass')
]

## Cleaning the metadata
Some of the data I need for the features of the model is available in the meta data file so I will be saving it into a csv file.

I will also look through the feature data to make sure no values are missing. If they are I will remove them from the dataframe.

In [4]:
needed_columns = [
    'system_id', 'latitude', 'longitude', 'elevation_m', 'azimuth', 'tilt'
]
final_df = filtered_df[needed_columns].copy()

# Iterates through each row and checks for NaN values, dropping the row if there are any.
for index, row in final_df.iterrows():
    if row.isna().any():
        final_df.drop(index, inplace=True)
            
display(final_df.head())
final_df.to_csv('../processed_data/system_data.csv', index=False)

Unnamed: 0,system_id,latitude,longitude,elevation_m,azimuth,tilt
2,4,39.7406,-105.1774,1795.3,180.0,40.0
3,10,39.7404,-105.1774,1792.8,180.0,40.0
4,33,39.7404,-105.1772,1794.0,180.0,40.0
5,34,36.1952,-115.1582,620.0,180.0,11.2
8,50,39.742,-105.1727,1994.7,158.0,45.0
