# Implementing RNN for sequential data processing.

The previous labs were focused on image recognition. CNN are really good for that as they capture spatial information. But what happens when we have sequential data ? We need a model that is able to capture dependency through time or space, that's what RNNs are made for.

In [None]:
## Setup the imports 
import numpy as np 
import matplotlib.pyplot as plt
import csv

## Data preparation

This time the dataset will be a weather dataset. You have 2 csv file (train and test data) with climate information over 4 years. Each data point has five features : the date, the temperature, the humidity, the pressure, and the wind speed. 

We will predict the meanpressure over time, but you can also select other targets.

In [None]:
## First let's load the data
with open("DailyDelhiClimateTrain.csv") as f :
    reader = csv.reader(f)
    data = list(reader)

field_names = data[0]

print(field_names)
print(data[1])

As the data is in a csv file, loading it loads strings, we need to convert it to an input that the model is familiar with. The following code achieves this purpose.

In [None]:
## dates are in date format, we will simply increment a value and convert them to timestamp :
timestamp = np.arange(len(data[1:]))
print(timestamp)

## the other values are float values that have been read as strings :
meantemp = np.array([float(d[1]) for d in data[1:]])
print(meantemp)

humidity = np.array([float(d[2]) for d in data[1:]])
print(humidity)

windspeed = np.array([float(d[3]) for d in data[1:]])
print(windspeed)

meanpressure = np.array([float(d[4]) for d in data[1:]])
print(meanpressure)

We will now create a dataset to load the data. We need to distinguish whether it's a train set or a test set. The following code creates a basic Climate Dataset.

In [None]:
class ClimateDataset :
    def __init__(self, is_train, batch_size) -> None:
        self.batch_size = batch_size
        self.is_train = is_train
        if self.is_train :
            filename = "DailyDelhiClimateTrain.csv"
        else : 
            filename = "DailyDelhiClimateTest.csv"
        
        with open(filename) as f :
            reader = csv.reader(f)
            data = list(reader)

        self.field_names = data[0]

        self.timestamp = np.arange(len(data[1:]))
        meantemp = np.array([float(d[1]) for d in data[1:]])
        humidity = np.array([float(d[2]) for d in data[1:]])
        windspeed = np.array([float(d[3]) for d in data[1:]])
        meanpressure = np.array([float(d[4]) for d in data[1:]])

        ## we will predict the humidity
        self.features = np.stack([
            meantemp,
            windspeed,
            humidity
        ]).T

        self.target = meanpressure

    @property
    def data(self):
        return self.features
    
    def __len__(self): 
        return self.target.shape[0]

    def __getitem__(self, index):
        end = min(index + self.batch_size, len(self))
        return self.features[index:end], self.target[index:end]

Let's plot the data we are working with.

In [None]:
ds = ClimateDataset(True, 4)
fig = plt.figure()
fig.set_size_inches(14, 10)
ax1 = fig.add_subplot(2, 2, 1)

ax1.plot(range(len(ds)), ds.features[:,0])
ax1.set_title("meantemp")

ax2 = fig.add_subplot(2, 2, 2)
ax2.plot(range(len(ds)), ds.features[:,1])
ax2.set_title("windspeed")

ax3 = fig.add_subplot(2, 2, 3)
ax3.plot(range(len(ds)), ds.features[:,2])
ax3.set_title("humidity")

ax4 = fig.add_subplot(2, 2, 4)
ax4.plot(range(len(ds)), ds.target)
ax4.set_title("meanpressure")


As you can notice, the data is quite noisy, especially the mean pressure has outliers. This will be challenging for the network to learn. We need to preprocess the data to be able to have smoother inputs.

## Data preprocessing 

To preprocess the data, we will do three things : first we will deal with outliers, second we will run a sliding window to average over time and make a smoother input. Finally we will apply min-max processing.

### Outlier removal

To deal with outliers in the pressure data, we will compute the mean value of the pressure and then look at every datapoint. If the difference between the pressure and the mean value is greater than a set threshold, then we assign the mean pressure to this datapoint.

In [None]:
def solve_outlier(pressure, threshold=50):
   ### TODO : fix the outlier present in the pressure by assigning the mean pressure to data points that have irregular values.
    return pressure

## TODO : modify the ClimateDataset class to solve the outlier pressure problem during loading of the data.

## TODO : plot the graph of the new pressure, it should look like the following.

Your new pressure should look like this : 

<img src="pressure_after_outlier_removal.png">

### Rolling average 

Now we need to smoothen our data, the data is very noisy and can easily confuse the network. To smoothen it we will run a rolling average (sliding window) over the data and compute the new datapoint at the average over the window. Rolling average is widely used when dealing with sequential or temporal data, as it highlights patterns and meaningful signals by smoothing out the noise.

In [None]:
def rolling_average(feat, window_size=30): 
    ## TODO : implement the rolling average.
    ## new_feat[i] should be the mean of the window of size window_size that comes before i
    ## if i < window_size then just take the mean from the beginning.
    new_feat = np.zeros_like(feat)
    
    return new_feat


## TODO : modify your dataset so that the features and target are smoothed with a window of size 30

## TODO : plot your results, it should look like the following.

Your data after rolling average should look like this : 

<img src="roll_averaged_data.png">

### Min-max processing

The last preprocessing that will be done will be min-max preprocessing.
This time we are not dealing with image pixels but with recorded values of real world value. As you may notice, these values vary greatly in scale. Some are between 0 and 10 while others are in the thousand. We have seen in previous labs that neural networks are sensitive to scale, we thus need to normalize values between 0 and 1. We will use as we did before the min-max scaling : 

$$ x' = \frac{x - min(x)}{max(x)-min(x)} $$

However this time we have a `regression` problem and not a classification problem, so we need to also scale the `target`. This also means that you need to keep track of the $max$ and $min$ so you can retrieve the true values later.

In [None]:
def min_max_processing(feat) :
    ## TODO : implement min-max preprocessing to scale the values between 0 and 1
    ## ALSO return the min/max values so you can scale it back later

    return feat

## TODO : modify your dataset to apply min-max processing.

    
## TODO : plot the new values, should look like the following

Your fully processed data should look like this, notice it's the same graph as before but with values between 0 and 1. 

<img src="fully_processed.png">

### Train-test-val split

This time also we need train-test-validation split.

In [None]:
## TODO : create a train-test-val split for this task. 
## The data is already split into train-test. We need a validation set.
## BE CAREFUL : you need the validation set to be SEQUENCES, DON'T SHUFFLE

## Model implementation.

We will implement the LSTM (long short term memory) module. LSTM is a modification of RNN that was introduced to deal with the problem of vanishing gradients in RNNs.

The following is the diagram of the LSTM unit taken from [wikipedia](https://en.wikipedia.org/wiki/Long_short-term_memory)

<img src="LSTM_Cell.svg">


An LSTM unit is composed of a cell and three gates : an input gate, an output gate, and a forget gate. The computation made by the unit are as follows : 

- $f_t = \sigma_g(W_fx_t + U_fh_{t-1} + b_f)$   forget gate computation.
- $i_t = \sigma_g(W_ix_t + U_ih_{t-1} + b_o)$   input gate computation.
- $o_t = \sigma_g(W_ox_t + U_oh_{t-1} + b_o)$   output gate computation.
- $c'_t = \sigma_c(W_{c'}x_t + U_{c'}h_{t-1} + b_{c'})$  cell intermediate value.
- $c_t = f_t\odot c_{t-1} + i_t \odot c'_t$    update of cell value.
- $h_t = o_t \odot \sigma_h(c_t)$   update of hidden state value.

Where $c_t$ represents the value of the cell at time $t$ and $h_t$ is the hidden state at time t, with $c_{-1}$ = 0 and $h_{-1}=0$. 

If we note $d$ the input dimension and $h$ the hidden dimension, $W \in \mathbb{R}^{h\times d}$ and $U \in \mathbb{R}^{h\times h}$ are weight matrices.

Finally $\sigma_g$ is the sigmoid activation function, $\sigma_c$ is the hyperbolic tangent activation, and $\sigma_h$ is either hyperbolic tangent or identity (you can choose).

$\odot$ is just the element-wise product.

The hyperbolic tangent is given by : 

$$ tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

TODO : compute the derivative of the tanh function (pen and paper).

In [None]:
## TODO : do the todo above.

### Activation functions

We will warm up by implementing the activation layers.

In [None]:
## TODO : implement sigmoid, tanh, relu and their backward function. 
## normally you already have sigmoid and relu.
## you can use np.thanh
def sigmoid(x):
    return x

def tanh(x):
    return x

def relu(x):
    return x

class SigmoidLayer : 
    """Use previous lab implementation"""

class ReLULayer : 
    """Use previous lab implementation"""

class TanhLayer :
    """Implement the tanh layer similarly to the other layers"""

    

## Time dependence

The model we are about to implement introduces a dependency through time. Since we are dealing with sequential data, the previous input of the sequence will influence the current processed input. This means that the same input could produce a different output, depending on the previous inputs. This is important to keep in mind because it means that backpropagation will not only consider the current gradient, but also gradients from previous computation. We are going to perform a BackPropagation Through Time (BPTT).

### BPTT 

As a consequence of the dependence through time and the chain rule, gradients need to be accumulated over timesteps. Suppose we have a sequence of $T$ timesteps and want to backpropagate the gradient through them. The total loss of the sequence is : 
$$ \mathcal{L}= \sum_{t=1}^T \mathcal{L_t} $$

So the total gradient of the loss with respect to a parameter $ W$ will be : 

$$ \frac{\partial \mathcal{L}}{\partial W} = \sum_{t=1}^T \frac{\partial \mathcal{L_t}} {\partial W} $$

For recurrent models like LSTMs, parameters $W$ influence the loss $\mathcal{L}_t$ not just directly at timestep $t$, but also indirectly through their effect on earlier timesteps. This happens because the hidden state $h_t$, which depends on $W$, is passed forward through time and influences future states.

### Implementation

For more details and information on how to perform backpropgation through time, see the supplementary information document present with this lab.

In [None]:
class Gate:
    """Implement the gates.
    Hint : have you noticed the gates look like linear layers and an activation ?"""

    def __init__(self, input_channel, hidden_dim, activation_layer) -> None:
        """Should have a W matrix, a U matrix and a bias vector.
        The activation layer is either a sigmoid or a tanh layer."""

    
    def forward(self, x, h):
        """Performs gate computation, linear part and activation.
           x is input at time t, h is hidden state at time t-1 """
        return h

    def backward(self, dLdy, x_prev, h_prev): 
        """Inputs : 
            - dLdy : gradient with respect to the output at time t.
            - x_prev : input at time t
            - h_prev : hidden state at time t-1.
        
        Computes : 
            dLdz : backward through the activation
            dLdW : gradient with respect to W at time t
            dLdU : gradient with respect to U at time t
            dLdb : gradient with respect to b at time t 
        
        The gradients should be SUMMED while you are considering the same sequence.

        Outputs : 
            - dLdx_prev : gradient with respect to the input at time t
            - dLdh_prev : gradient with respect to the hidden state at time t-1"""

        dLdx_prev = np.zeros_like(x_prev)
        dLdh_prev = np.zeros_like(h_prev)


        return dLdx_prev, dLdh_prev
    
    def reset_gradients(self): 
        """You should reset the gradients to 0 after step"""
        
    def step(self, lr) : 
        """Update parameters and then reset gradients."""

        self.reset_gradients()
    
    def __call__(self, x, h):
        return self.forward(x,h)
        


class LSTMLayer :
    """Implements the LSTM layer. You can follow the previous lab's API.
    Use three gates and maintain a cell and a hidden state. Hint : the cell is also using a gate.
    You should also maintain a cache while computing the same sequence.
    """
    def __init__(self, input_dim, hidden_dim) -> None:
        """Initialize gates and cell, initialize the cache of the sequence.
        Note : this unit has no parameters, only the gates and cell have it."""

    def forward(self, x):
        """For a sequence you should : 
            - Reset the cache
            - Find the length of the sequence.
            - Loop through time and : 
                - Compute the output of each gate at time t.
                - Compute the final cell value at time t.
                - Compute the hidden state which is the output for the time t.
                - Save ALL intermediate values to the cache
            - Return the final output which should be a sequence ofthe same length as the input sequence."""
        
        return x
    
    def backward(self, dLdh):
        """
        Backward pass for the LSTM unit. 
        Input : 
            - dLdh : the gradient of the loss with respect to the output per time step.
        
        Output : 
            - dLdx : the gradient of the loss with respect to the input sequence.
        
        To correctly compute each gradient, you need to use the chain rule and the formulaes of the LSTM unit.
        You will need to compute each intermediate value's gradient. Especially, before going backward in each gate,
        you will need to compute the gradient w.r.t the output of each gate.
        """
        T = len(dLdh) ## length of the sequence

        dx = np.zeros((T, self.input_dim), dtype=np.float32) 

        for i in reversed(range(T)): ### Going back through time
            pass

            # Gradients for cell state and output

            
            # Gradients of the gate outputs

            
            # Backpropagate through gates

            
            # Combine gradients from all gates

            ## Be mindful that after the first iteration, the gradient with respect to the
            ## hidden state and the cell now have a new dependency.
            ## See supplementary information to find that dependency

        
        return dx
    
    def reset_cache(self) : 
        """Reset the cache"""
        pass

    def step(self, lr):
        """Step through each gate, this unit has no parameters."""
        pass

    def __call__(self, x):
        return self.forward(x)

In [None]:
class LinearLayer :
    """Use your previous lab implementation.
    TODO : modify it to be able to take batched input.
    Instead of being a vector of size C, it should be an input of size (T,C),
    Where T is the sequence length."""

We will now implement a full model. You need to use linear layers from the MLP lab as well as at least one LSTM unit.
Don't forget the activation for the linear layers.

We need to predict only one value so the output should only have one dimension.

In [None]:
## TODO : implement a RNN model, that has Linear layers and LSTM units.
def clip(value, clip_threshold=10): 
    """You can use gradient clipping like for the CNN lab.
    Although it is likely not useful here."""
    return value

class RNNModel :
    """Implements a full RNN model, with at least one LSTM unit.
    Take inspiration from the full MLP and the full CNN model.
    """
    def __init__(self, input_dim, lstm_dim, mlp_description) -> None:
        """Initializes self.layers using the descriptions.
        Start with the lstm unit and then the linear part. 
        Don't forget activations (relu) for the linear part."""

    def forward(self, x):
        return x
    
    def backward(self, gradient):
        return gradient
    
    def step(self, alpha):
        pass

    def __call__(self, x):
        return self.forward(x)

## Training the model.

We will now train our model. This time our problem is a regression problem, because we are trying to predict a real value. We will thus use the same loss function as we used in lab 1 - Perceptron to solve OR. The L2-loss function : 

$$ \mathcal{L}(\hat{y}, y) = \frac{1}{2}(\hat{y}-y)^2$$

Whose derivative is : 

$$ \frac{\mathcal{dL}}{d\hat{y}} = (\hat{y} - y) 

In [None]:
##TODO : implement the L2-loss, you can reuse lab 1 implementation
def l2_loss(y_pred, y_true):
    return 0

In [None]:
# TODO : implement the training loop with the validation loop
def validation(model, val_dataset):
    """Reuse previous lab implementation.
    Be mindful the loss has changed."""

def train(model, train_set, lr, num_epochs, val_set):
    losses = []
    best_model = None
    best_model_loss = float("inf")
    ### TODO : reuse previous labs implementation
    ### be mindful the loss (and its derivative) has changed
    ### Don't forget the validation loop
    return losses, best_model


In [None]:
## TODO : train your model. The architecture bellow should give you good results
## If the implementation is correct. Any deeper architecture might lead to vanishing gradients
## which will require more optimization out of the scope of this lab
model = RNNModel(3, 100, [1]) 
lr = 0.001
num_epochs = 200


## Evaluation

This time it is not a classification problem, so there is no accuracy or recall to compute. Instead we will compute the Root Mean Squared Error (RMSE) of the model over the test set. And visualize the prediction against the true values.

The RMSE is given by : 
$$ RMSE = \sqrt{\sum_{i_1}^n \frac{(y_i - \hat{y_i})^2}{n}} $$



In [None]:
## TODO: COmpute the RMSE of the model over the test set

In [None]:
# TODO : Show the predicted values and the test values on the same graph

BONUS : Repeat the experiment for each possible feature of the dataset, changing the target each time. Plot the results every time.