# Libraries

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from matplotlib import pyplot as plt

# Load Data

In [2]:
contracts = pd.read_csv('datafiles_assignment_data_scientist/contracten_database_table.csv', sep = ';')
values_actual = pd.read_json('datafiles_assignment_data_scientist/actual_values.json')
pv_forecast = pd.read_json('datafiles_assignment_data_scientist/forecast_values.json')
weather_actual = pd.read_json('datafiles_assignment_data_scientist/weather_actual.json')
weather_forecast = pd.read_json('datafiles_assignment_data_scientist/weather_forecast.json')
prices = pd.read_json('datafiles_assignment_data_scientist/price_epex.json')

# Data Preprocessing
## Data cleaning
Remove outliers using threshold defined in EDA.

In [8]:
# remove outlier
threshold = 5*10e3
consumption = values_actual[['contract_id', 'timestamp', 'p1_actual_kwh', 'pv_actual_kwh']].copy()
consumption = consumption[(consumption.p1_actual_kwh > -threshold) & \
                          (consumption.p1_actual_kwh < threshold)].copy()

In [3]:
def remove_outliers_zscore(data, threshold=3):
    """
    Removes outliers from a time series using the Z-score method.
    
    Parameters:
    data (np.ndarray): The time series data.
    threshold (float): The Z-score threshold for identifying outliers. Defaults to 3.
    
    Returns:
    np.ndarray: The time series data with outliers removed.
    """
    z_scores = np.abs((data - np.mean(data)) / np.std(data))
    mask = z_scores < threshold
    return data[mask]

## Handling of missing values
Set to 0 Nan of p1_actual_kwh

In [9]:
consumption['pv_actual_kwh'] = consumption['pv_actual_kwh'].replace(np.nan, 0)

## Data split
### Split: with solar panels from/not from Zonneplan

In [11]:
# Set of contract_id having solar panels with Zonneplan (based of pv_actual_kwh)
pv_w_zp_contract = set(values_actual[~values_actual['pv_actual_kwh'].isnull()]['contract_id'])

# (Inferred) Set of contract_id having solar panel not with Zonneplan
# To do so, we assume that contract_id with p1_actual_kwh < 0 have solar panels
customer_w_pv = set(values_actual[values_actual['p1_actual_kwh'] < 0]['contract_id'])
pv_wo_zp_contract = customer_w_pv - pv_w_zp_contract

In [16]:
# Data for customers with solar panels not installed by Zonneplan 
consumption_wo_zp = consumption[consumption.contract_id.isin(pv_wo_zp_contract)]

# Data for other customer without solar panels or with solar panels installed by Zonneplan
consumption_w_zp = consumption[~consumption.contract_id.isin(pv_wo_zp_contract)]

### Split: Training and Test

## Feature selection
Selected feature:
* hour
* temperature
* price

Consumption:
* for customers with solar panels installed by Zonneplan: p1_actual_kwh + pv_actual_kwh (or pv_forecast_kwh for test data)
* for customers with solar panels not installed by Zonneplan: get the averaged pv_actual_kwh/pv_forecast_kwh an add it to the p1_actual_pwh

# Training Forecast
**Methods:**
* Naive methods: Average previous consumption off all households together by considering feature like time
* Conidering additional pv input: Predict the solar production for household with solar without Zonneplan
* fine-tune per customer: create model per customer

# Test
* Predict the average at a certain time and multiple by the number of household

# Forecast

# Challenges
* Combination of households with solar panels installed by Zonneplan and not with Zonneplan: some household have energy sources we are not aware of. They can be identified by having negative values of consumption.

# Possible next steps¶
* Look at possible clusters: based on infrastructure (with or withut solar panels), consumption behavior pattern (households consuming more at certain time than other householads, and less at certain time), cluster of location
* Consideration of additional solar power data: size and tild of the solar panels
* Classification of customer with/without solar panels: using other data sources? Solar panels detection from satellites images?
* Look at other data set like Kaggle

To improve this prediction in the coming months, we will continue to collect and analyze historical data and update our models accordingly. We will also explore other forecasting methods, such as the Prophet model developed by Facebook, and machine learning techniques, such as random forests and neural networks.

One aspect that could have been better is the data cleaning and preprocessing. There were some missing values and inconsistencies in the data that needed to be addressed before we could create an accurate forecast. In the future, we will need to ensure that we have clean and consistent data before proceeding with the forecasting process.

Another area for improvement is the incorporation of external factors that could impact consumption, such as holidays, events, and changes in government policies. By including these factors in our forecasting models, we can create more accurate and robust predictions for our energy portfolio.