# Grading process


The submission notebook will be autovalidated with `papermill`. The exact command is the following:

```bash
papermill <notebook-name>.ipynb <notebook-name>-run.ipynb .ipynb -p TEST True
```

Papermill will inject new cell after each cell tagged as `parameters` (see `View > Cell toolbar > Tags`). Notebook will be executed from top to bottom in a linear order. `solutions.py` contains correct implementations used to validate your solutions.

Please, **fill `STUDENT` variable with the name of submitting student**, so that we can collect the results automatically. Please, **do not change `TEST` variable** and `validation` cells. If you need to inject your own code for testing, wrap it into

```python
if not TEST:
    ...
```

Different problems give different number of points. All problems in the basic section give 1 point, while all problems in intermediate section give 2 points.

Each problem contains specific validation details. You need to fill each cell tagged `solution` with your code. Note, that solution function must self-contained, i.e. it must not use any state from the notebook itself.

# Dataset

All problems in the assignment use [electricity load dataset](https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014). Some functions/methods accept data itself, and in that case it's a Pandas dataframe as obtained by

```python
df = pd.read_csv("LD2011_2014.txt",
                 parse_dates=[0],
                 delimiter=";",
                 decimal=",")
df.rename({"Unnamed: 0": "timestamp"}, axis=1, inplace=True)
```

In contrast, whenever a function/method accepts a filename, it's the filename of **unzipped** data file (i.e. `LD2011_2014.txt`). When testing, do not rely on any specific location of the dataset, as validation environment will most certainly different from your local one. Hence, calls like

```python
df = pd.read_csv("<your-local-directory>/LD2011_2014.txt")
```

will fail.

In [4]:
import numpy as np
import pandas as pd
import torch

In [2]:
STUDENT = "Elad Eatah"

In [51]:
ASSIGNMENT = 1
TEST = False

In [52]:
if TEST:
    import solutions
    total_grade = 0
    MAX_POINTS = 12

# Pandas

### 1. Resample the dataset (1 point)

Resample the dataset to 1-hour resolution. Use `mean` as an aggregation function. Your function must output a dataframe, with the same structure as the original one (i.e. not indexed by datetime).

In [24]:
def el_resample(df):
    return df.resample('1H', on="timestamp").mean().reset_index()

df1 = pd.read_csv("LD2011_2014.txt",
                 parse_dates=[0],
                 delimiter=";",
                 decimal=",")
df1.rename({"Unnamed: 0": "timestamp"}, axis=1, inplace=True)
df1.save_csv('ElectricityConsumptionHourly.csv')

In [25]:
PROBLEM_ID = 1

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, el_resample)

### 2. Consumption peaks (1 point)

For each household, calculate, which month in 2014 had the highest consumption. Your function must output series, indexed by household ID (e.g., `MT_XXX`), and containing month as an integer (`1-12`).

In [47]:
def cons_peak(df):
    # Picking only the entries from the year 2014.
    df.set_index('timestamp', inplace=True)
    df_temp = df.loc['2014'].reset_index()
    
    # Resampling on 1-month resolution (with mean aggregation func) to find the mean consumed energy
    df_temp = df_temp.resample('1M', on="timestamp").mean()
    
    # Replacing the index by the month only, as an integer.
    df_temp.index = df_temp.index.month
    
    # Taking the maximum index for each column
    series = df_temp.idxmax(axis=0)
    return series

In [None]:
PROBLEM_ID = 2

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, cons_peak)

# PyTorch

### 3. Find minimum (2 points)

Consider the following scalar function:

$$
f(x) = ax^2 + bx + c
$$

Given $a,b,c$, find $x$, which minimizes $f(x)$, and minimum value of $f(x)$. Note this:

- $a,b,c$ are fixed, and generated in such a way, that minimum always exists ($f(x)$ is convex),
- $x$ is a scalar value, i.e. 0-dimensional tensor.

For reference, see `generate_coef` function, which is used to generate coefficients. Note, that since optimization process is not completely deterministic, the output is considered correct, if it falls within `1e-3` of actual values.

This problem must be solved as an optimization one using gradient descent.

For that, use only PyTorch functionality, `SciPy` (or alike) optimization routines are not allowed, neither is direct calculation using coefficients.

In [1]:
def generate_coeffs():
    a = torch.rand(size=()) * 10
    b = -10 + torch.rand(size=()) * 10
    c = -10 + torch.rand(size=()) * 10
    return a, b, c

def func(x, a, b, c):
    return x.pow(2) * a + x * b + c

In [2]:
def find_min(a, b, c):
    # Initial guess is random
    x_min = torch.randn(1, requires_grad=True, dtype=torch.float)

    # Setting the learning rate
    learning_rate = 0.1
    
    # Defines number of epochs
    n_epochs = 1000
    
    for i in range(n_epochs):
        # val_min is both the "predicted value" and the "loss" value
        val_min = func(x_min, a, b, c)
        
        # Backpropagation and calculating the gradient of x_min
        val_min.backward()
        
        # Update the guess using the gradient (derivative) and the learning rate
        with torch.no_grad():
            x_min -= learning_rate * x_min.grad
        
        # Zeroing the gradient, in order to prevent acuumulation in this gradient.
        x_min.grad.zero_()
        
    # Returning the result as floating-points numbers (single precision)
    return x_min.detach().numpy()[0], val_min.detach().numpy()[0]

In [19]:
PROBLEM_ID = 3

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, find_min)

### 4. PyTorch `Dataset` (3 points)

Implement a `torch.utils.data.Dataset` sub-class for the electricity consumption data. Individual training instances must be week-long univarite series of hourly consumption (input, 168 values), followed by 24-hours long series of hourly consumption (output, 24 values) for a single household. Such a class can be used when training a consumption forecast model, which uses 7 days of historical consumption to forecast next 24 hours of consumption.

`__getitem__(self, idx)` must return a tuple of 1D tensors, `in_data` and `out_data`. `in_data` contains 168 hours of consumption (hourly), starting from some `start_ts`, while `out_data` must contain 24 hourly consumption values starting from `start_ts + 168 hours` for some household. `start_ts` should be sampled randomly.

Also, you need to implement a `get_mapping(self, idx)` method, which allows to calculate `(household, starting time) -> idx` correspondence.

This class will be validated as the following:

- dataset object is created with some random `samples`: `dataset = ElDataset(df, samples)` ,
- validator fetches random `idx` (between `0` and `len(dataset)`) from the dataset:
```python
household, start_ts = dataset.get_mapping(idx)
hist_data, future_data = dataset[idx]
```
- then, `hist_data` and `future_data` are compared with the data, obtained directly from `df` using `household, start_ts`.

In [7]:
# Configurable params
input_length = 168
output_length = 24

class ElDataset(torch.utils.data.Dataset):
    """Electricity dataset."""

    def __init__(self, df, samples):
        """
        Args:
            df: original electricity data (see HW intro for details).
            samples (int): number of sample to take per household.
        """
        self.raw_data = df.resample('1H', on="timestamp").mean().reset_index()  # Aggregation as in Q1
        self.samples = samples

    def __len__(self):
        return self.samples * (self.raw_data.shape[1] - 1)

    def __getitem__(self, idx):
        household, start_ts = self.get_mapping(idx)  # Find the mapped household and start_ts
        all_data = self.raw_data[self.raw_data["timestamp"] >= start_ts][household]  # Pick values from start_ts and forward
        
        # Pick the first 7 days of data for hist_data and the last (8-th) day's data for future_data
        # Data is returned as Pytorch 1D tensors!
        hist_data = torch.tensor(all_data[:input_length].values)
        future_data = torch.tensor(all_data[input_length: input_length + output_length].values)
        return hist_data, future_data

    def get_mapping(self, idx):
        torch.manual_seed(idx)  # Setting seed for reprodubility and synchronization between this method and __getitem__
        household_ind = idx % (self.raw_data.shape[1] - 1)  # Mapping idx to a specific household
        household = self.raw_data.columns[household_ind]
        time_samples_count = len(self.raw_data.index)
        start_ts_ind = torch.randint(high=time_samples_count - input_length - output_length, size=(1,)).numpy()[0]
        start_ts = self.raw_data["timestamp"][start_ts_ind]
        return household, start_ts

In [19]:
PROBLEM_ID = 4

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, ElDataset)

# Your grade

In [None]:
if TEST:
    print(f"{STUDENT}: {total_grade}")