# Grading process


The submission notebook will be autovalidated with `papermill`. The exact command is the following:

```bash
papermill <notebook-name>.ipynb <notebook-name>-run.ipynb .ipynb -p TEST True
```

Papermill will inject new cell after each cell tagged as `parameters` (see `View > Cell toolbar > Tags`). Notebook will be executed from top to bottom in a linear order. `solutions.py` contains correct implementations used to validate your solutions.

Please, **fill `STUDENT` variable with the name of submitting student**, so that we can collect the results automatically. Please, **do not change `TEST` variable** and `validation` cells. If you need to inject your own code for testing, wrap it into

```python
if not TEST:
    ...
```

Different problems give different number of points. All problems in the basic section give 1 point, while all problems in intermediate section give 2 points.

Each problem contains specific validation details. You need to fill each cell tagged `solution` with your code. Note, that solution function must self-contained, i.e. it must not use any state from the notebook itself.

# Dataset

All problems in the assignment use [electricity load dataset](https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014). Some functions/methods accept data itself, and in that case it's a Pandas dataframe as obtained by

```python
df = pd.read_csv("LD2011_2014.txt",
                 parse_dates=[0],
                 delimiter=";",
                 decimal=",")
df.rename({"Unnamed: 0": "timestamp"}, axis=1, inplace=True)
```

In contrast, whenever a function/method accepts a filename, it's the filename of **unzipped** data file (i.e. `LD2011_2014.txt`). When testing, do not rely on any specific location of the dataset, as validation environment will most certainly different from your local one. Hence, calls like

```python
df = pd.read_csv("<your-local-directory>/LD2011_2014.txt")
```

will fail.

In [1]:
import numpy as np
import pandas as pd
import torch

In [None]:
STUDENT = "Erez Shwarts"

In [None]:
ASSIGNMENT = 1
TEST = False

In [None]:
if TEST:
    import solutions
    total_grade = 0
    MAX_POINTS = 12

# Pandas

### 1. Resample the dataset (1 point)

Resample the dataset to 1-hour resolution. Use `mean` as an aggregation function. Your function must output a dataframe, with the same structure as the original one (i.e. not indexed by datetime).

In [2]:
df = pd.read_csv("LD2011_2014.txt",
                 parse_dates=[0],
                 delimiter=";",
                 decimal=",")
df.rename({"Unnamed: 0": "timestamp"}, axis=1, inplace=True)

In [None]:
df

In [3]:
def el_resample(df):
    df_copy = df.copy()
    df_copy.set_index("timestamp", inplace=True)
    return df_copy.resample("H").mean()

In [None]:
PROBLEM_ID = 1

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, el_resample)

### 2. Consumption peaks (1 point)

For each household, calculate, which month in 2014 had the highest consumption. Your function must output series, indexed by household ID (e.g., `MT_XXX`), and containing month as an integer (`1-12`).

In [None]:

def cons_peak(df):
    CHOSEN_YEAR = 2014
    copy_df = df.copy()
    copy_df = copy_df[copy_df["timestamp"].dt.year == CHOSEN_YEAR]
    copy_df["month"] = copy_df["timestamp"].dt.month
    return copy_df.groupby("month").agg("sum").idxmax(axis=0)
    

In [None]:
PROBLEM_ID = 2

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, cons_peak)

# PyTorch

### 3. Find minimum (2 points)

Consider the following scalar function:

$$
f(x) = ax^2 + bx + c
$$

Given $a,b,c$, find $x$, which minimizes $f(x)$, and minimum value of $f(x)$. Note this:

- $a,b,c$ are fixed, and generated in such a way, that minimum always exists ($f(x)$ is convex),
- $x$ is a scalar value, i.e. 0-dimensional tensor.

For reference, see `generate_coef` function, which is used to generate coefficients. Note, that since optimization process is not completely deterministic, the output is considered correct, if it falls within `1e-3` of actual values.

This problem must be solved as an optimization one using gradient descent.

For that, use only PyTorch functionality, `SciPy` (or alike) optimization routines are not allowed, neither is direct calculation using coefficients.

In [8]:
import torch

In [9]:
def generate_coeffs():
    a = torch.rand(size=()) * 10
    b = -10 + torch.rand(size=()) * 10
    c = -10 + torch.rand(size=()) * 10
    print(a, b, c)
    return a, b, c

def func(x, a, b, c):
    return x.pow(2) * a + x * b + c

In [10]:

a,b,c = generate_coeffs()

x = torch.tensor(1)




tensor(6.3511) tensor(-8.1222) tensor(-4.9966)


In [11]:
find_min(a,b,c)

24.730308532714844 tensor(28.6560)
-5.2329254150390625 tensor(-7.7437)
-7.420960903167725 tensor(2.0926)
-7.580739974975586 tensor(-0.5655)
-7.5924072265625 tensor(0.1528)
-7.593259811401367 tensor(-0.0413)
-7.593321800231934 tensor(0.0112)
-7.593326568603516 tensor(-0.0030)


  x_iter = torch.tensor(10 * torch.rand(size=()), requires_grad=True)
  x_iter = torch.tensor(x_iter - MEU * grad, requires_grad=True)


(-7.593326568603516, tensor(0.6395, requires_grad=True))

In [5]:
def find_min(a, b, c):
    
    EPS = 1e-2
    MEU = 0.1

    # using convex optimization methods
    # return x_min, val_min
    
    # making sure x_iter is leaf so autograd will function correctly
    x_iter = torch.tensor(10 * torch.rand(size=()), requires_grad=True) 
    grad = torch.tensor(1.)
    
    while torch.norm(grad) > EPS:
        target_tensor = func(x_iter, a, b, c)
        target_tensor.backward()
        grad = x_iter.grad
        print(target_tensor.item(), grad)
        x_iter = torch.tensor(x_iter - MEU * grad, requires_grad=True)
         
    return target_tensor.item(), x_iter

In [None]:
PROBLEM_ID = 3

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, find_min)

### 4. PyTorch `Dataset` (3 points)

Implement a `torch.utils.data.Dataset` sub-class for the electricity consumption data. Individual training instances must be week-long univarite series of hourly consumption (input, 168 values), followed by 24-hours long series of hourly consumption (output, 24 values) for a single household. Such a class can be used when training a consumption forecast model, which uses 7 days of historical consumption to forecast next 24 hours of consumption.

`__getitem__(self, idx)` must return a tuple of 1D tensors, `in_data` and `out_data`. `in_data` contains 168 hours of consumption (hourly), starting from some `start_ts`, while `out_data` must contain 24 hourly consumption values starting from `start_ts + 168 hours` for some household. `start_ts` should be sampled randomly.

Also, you need to implement a `get_mapping(self, idx)` method, which allows to calculate `(household, starting time) -> idx` correspondence.

This class will be validated as the following:

- dataset object is created with some random `samples`: `dataset = ElDataset(df, samples)` ,
- validator fetches random `idx` (between `0` and `len(dataset)`) from the dataset:
```python
household, start_ts = dataset.get_mapping(idx)
hist_data, future_data = dataset[idx]
```
- then, `hist_data` and `future_data` are compared with the data, obtained directly from `df` using `household, start_ts`.

In [12]:
import numpy as np

In [13]:
temp_df = el_resample(df)
temp_df.reset_index(inplace=True)

SAMPLES = 100
X_SEQUENCE_LEN = 168
Y_SEQUENCE_LEN = 24
PRED_OFFSET = X_SEQUENCE_LEN + Y_SEQUENCE_LEN
NUM_HOUSEHOLDS = len(temp_df.columns.tolist()[1:]) 
DATA_LEN = len(temp_df)

timestamps_series = temp_df.iloc[:,0]
chosen_indexes = np.random.choice(DATA_LEN - (X_SEQUENCE_LEN + Y_SEQUENCE_LEN), (NUM_HOUSEHOLDS, SAMPLES))

households = temp_df.columns.tolist()[1:]

data_list = []

for household_index,household in enumerate(households):
    household_series = temp_df[household]
    for ts_index in chosen_indexes[household_index,:]:
        cur_x_sample = torch.tensor(household_series[ts_index:ts_index+X_SEQUENCE_LEN].values)
        cur_y_sample = torch.tensor(household_series[ts_index+X_SEQUENCE_LEN: ts_index + PRED_OFFSET].values)
        list_item = (household, timestamps_series[ts_index], cur_x_sample, cur_y_sample)
        iter_list.append(list_item)


NameError: name 'iter_list' is not defined

In [14]:
from torch.utils.data import Dataset

In [38]:
class ElDataset(Dataset):
    """Electricity dataset."""

    def __init__(self, df, samples):
        """
        Args:
            df: original electricity data (see HW intro for details).
            samples (int): number of sample to take per household.
        """
        X_SEQUENCE_LEN = 168
        Y_SEQUENCE_LEN = 24
        PRED_OFFSET = X_SEQUENCE_LEN + Y_SEQUENCE_LEN
        
        self.raw_data = df.reset_index()
        
        NUM_HOUSEHOLDS = len(self.raw_data.columns.tolist()[1:]) 
        DATA_LEN = len(self.raw_data)
        
        self.samples = samples
        self.data_list = []
        
        timestamps_series = self.raw_data.iloc[:,0]
        chosen_indexes = np.random.choice(DATA_LEN - (X_SEQUENCE_LEN + Y_SEQUENCE_LEN), (NUM_HOUSEHOLDS, samples))

        households = self.raw_data.columns.tolist()[1:]

        for household_index,household in enumerate(households):
            household_series = self.raw_data[household]
            for ts_index in chosen_indexes[household_index,:]:
                cur_x_sample = torch.tensor(household_series[ts_index:ts_index+X_SEQUENCE_LEN].values)
                cur_y_sample = torch.tensor(household_series[ts_index+X_SEQUENCE_LEN: ts_index + PRED_OFFSET].values)
                list_item = (household, timestamps_series[ts_index], cur_x_sample, cur_y_sample)
                self.data_list.append(list_item)


    def __len__(self):
        return self.samples * (self.raw_data.shape[1] - 1)

    def __getitem__(self, idx):
        # return hist_data, future_data
        data_entry = self.data_list[idx]
        return data_entry[2], data_entry[3]

    def get_mapping(self, idx):
        # your code goes here
        data_entry = self.data_list[idx]
        return data_entry[0], data_entry[1]

In [24]:
ds.raw_data

NameError: name 'ds' is not defined

In [26]:
temp_df = el_resample(df)
ds = ElDataset(temp_df, 1)


start
MT_001
MT_002
MT_003
MT_004
MT_005
MT_006
MT_007
MT_008
MT_009
MT_010
MT_011
MT_012
MT_013
MT_014
MT_015
MT_016
MT_017
MT_018
MT_019
MT_020
MT_021
MT_022
MT_023
MT_024
MT_025
MT_026
MT_027
MT_028
MT_029
MT_030
MT_031
MT_032
MT_033
MT_034
MT_035
MT_036
MT_037
MT_038
MT_039
MT_040
MT_041
MT_042
MT_043
MT_044
MT_045
MT_046
MT_047
MT_048
MT_049
MT_050
MT_051
MT_052
MT_053
MT_054
MT_055
MT_056
MT_057
MT_058
MT_059
MT_060
MT_061
MT_062
MT_063
MT_064
MT_065
MT_066
MT_067
MT_068
MT_069
MT_070
MT_071
MT_072
MT_073
MT_074
MT_075
MT_076
MT_077
MT_078
MT_079
MT_080
MT_081
MT_082
MT_083
MT_084
MT_085
MT_086
MT_087
MT_088
MT_089
MT_090
MT_091
MT_092
MT_093
MT_094
MT_095
MT_096
MT_097
MT_098
MT_099
MT_100
MT_101
MT_102
MT_103
MT_104
MT_105
MT_106
MT_107
MT_108
MT_109
MT_110
MT_111
MT_112
MT_113
MT_114
MT_115
MT_116
MT_117
MT_118
MT_119
MT_120
MT_121
MT_122
MT_123
MT_124
MT_125
MT_126
MT_127
MT_128
MT_129
MT_130
MT_131
MT_132
MT_133
MT_134
MT_135
MT_136
MT_137
MT_138
MT_139
MT_140
MT_141
MT_142


In [None]:
PROBLEM_ID = 4

if TEST:
    total_grade += solutions.check(STUDENT, PROBLEM_ID, ElDataset)

# Your grade

In [None]:
if TEST:
    print(f"{STUDENT}: {total_grade}")

In [None]:
print("Hello")