# Situering
- De eigenaar van een huishoudelijke zonnepaneelinstallatie zou graag continu een voorspelling hebben van de opbrengst van zijn panelen gedurende de komende uren om het eigen verbruik te optimaliseren: bij een verwachte hoge opbrengst kan hij dan bijv. beslissen om de wasmachine aan te zetten. 
- Hij beschikt over de meterstand per uur sedert ongeveer één jaar (solar.csv). 
- Daarnaast zijn ook de gegevens van de waarnemingen van het weer (weather.csv) en 
- de uren van zonsopgang en –ondergang in dezelfde periode periode (sunrise-sunset.xlsx). 

# Vraag
- Stel een regressiemodel op om de opbrengst per uur te voorspellen. 
- Kies een optimaal regressiemodel door verschillende modellen uit te proberen en te vergelijken volgens de "best practices". 
- Kies als maatstaf de gemiddelde afwijking van de absolute waarde op uurbasis.




In [None]:
import sys
import sklearn
import numpy as np
import pandas as pd
import os
import matplotlib as mpl
import matplotlib.pyplot as plt
from datetime import datetime

## Preparation and cleanup of solar data

In [None]:
def prepare_solar():
    solar = pd.read_csv('datasets/solar.csv')  
    print(solar.head(10))
    print('-------------')
    print(solar.tail(10))
    solar['datetime'] = pd.to_datetime(solar['timestamp'].str[0:19],format='%Y-%m-%d %H:%M:%S')      
    solar['datetime_tz'] = solar['datetime'].dt.tz_localize('Europe/Berlin',ambiguous='NaT')
    solar['datetime_utc'] = solar['datetime_tz'].dt.tz_convert('UTC')
    solar['date'] = solar['datetime_utc'].dt.date
    solar['hour'] = solar['datetime_utc'].dt.hour
    solar = solar[['date','hour','kwh']]
    solar.dropna(inplace=True)
    solar.drop_duplicates(inplace=True)
    # todo timediff en alles > 1u er uit. 
    return solar

solar = prepare_solar()
solar.info()
solar.head()

In [None]:
solar.tail()

## Preparation and cleanup of weather data

In [None]:
def prepare_weather():
    weather = pd.read_csv("datasets/weather.csv")
    print(weather.head(10))
    weather = weather[['timestamp','temp','pressure','cloudiness','humidity_relative']]
    weather = weather.groupby(by=['timestamp']).mean().reset_index()
    weather['datetime'] = pd.to_datetime(weather['timestamp'],format='%Y-%m-%dT%H:%M:%S')
    weather['date'] = weather['datetime'].dt.date
    weather['hour'] = weather['datetime'].dt.hour   
    weather = weather[['date','hour','temp','pressure','cloudiness','humidity_relative']]
    weather.drop_duplicates(inplace=True)
    return weather

weather = prepare_weather()
weather.info()
weather.head(30)


## Preparation and cleanup of sunrise/sunset data

In [None]:
!pip install openpyxl

In [None]:
def prepare_sunrise_sunset():
    sunrise_sunset = pd.read_excel("datasets/sunrise-sunset.xlsx") 
    print(sunrise_sunset.head(10))
    sunrise_sunset.rename(columns={'datum':'date'}, inplace=True)
    sunrise_sunset['sunrise'] = [pd.Timestamp.combine(d,t) for d,t in zip(sunrise_sunset['date'],sunrise_sunset['Opkomst'])]
    sunrise_sunset['noon'] = [pd.Timestamp.combine(d,t) for d,t in zip(sunrise_sunset['date'],sunrise_sunset['Op ware middag'])]
    sunrise_sunset['sunset'] = [pd.Timestamp.combine(d,t) for d,t in zip(sunrise_sunset['date'],sunrise_sunset['Ondergang'])]
    sunrise_sunset['sunrise'] = sunrise_sunset['sunrise'].dt.tz_localize('Europe/Berlin',ambiguous='NaT')
    sunrise_sunset['sunrise'] = sunrise_sunset['sunrise'].dt.tz_convert('UTC')
    sunrise_sunset['sunset'] = sunrise_sunset['sunset'].dt.tz_localize('Europe/Berlin',ambiguous='NaT')
    sunrise_sunset['sunset'] = sunrise_sunset['sunset'].dt.tz_convert('UTC')
    sunrise_sunset['date'] = sunrise_sunset['date'].dt.date
    sunrise_sunset = sunrise_sunset[['date','sunrise','sunset']]
    return sunrise_sunset

sunrise_sunset = prepare_sunrise_sunset()
sunrise_sunset.info()
sunrise_sunset.head()

Combine solar and weather data in a single dataframe. 

Now also combine this dataset with sunrise_sunset. 

## Feature Engineering
Only keep following features: 
- dayinyear: number of the day in de year (1/1 = 1, 31/12 = 365)
- sunrise_delta: hours after sunrise
- sunset_delta: hours before sunset
- temp
- pressure
- cloudiness
- humidity
- production (kW): yield of the current hour

Create a histogram for all numerical features

Which column requires further attention? Declare the odd data en fix it. 

Store the current dataframe to a csv file so we can use it later. 

Read the data from the csv file

Split the dataset in a training and a testset

Create a Random Forest model to predict the hourly production. 
- Create a pipeline with a StandardScaler and and a random forest regressor
- Find the optimal parameter combination amongst
  - bootstrap: False, True
  - n_estimators: 50 - 200 with steps of 50
  - max_depth: 10 - 50 with steps of 10 

Determine the mean absolute error on the test set. Is this a useful model? 

Explain the concept of noise in this context

.

Store the model to a file. 