# Building a Recurrent Neural Network

In this project, I will implement key components of a Recurrent Neural Network using numpy, to generate customer reviews.

I credit the content of this notebook to:
- Prof. Jeffrey Stanton whose HW and Lab notebooks have inspired this notebook on Text Generation. I have borrowed the theme and some of the code below from his notebook 'StantonRNNtextGen' from Week9 of the course IST664. 
- Prof. Andrew Ng whose 'Deep Learning Specialization' on Coursera has laid the foundation for my understanding of the mathematics behind the RNN model I've used below.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import string

# TASK AT HAND

Given a dataset of a large file with short product reviews, our model is aiming to generate user reviews when given a start to a sample review. We will use a recurrent neural network for character level text generation.

# RECURRENT NEURAL NETWORK

<img src="images/RNN.png" style="width:500;height:300px;">
<caption><center> **Figure 1**: Basic RNN model </center></caption>

$Recurrent Neural Networks (RNN)$ are a class of Neral Networks which are very useful in Natural Language Processing and other sequence tasks because they have "memory". They are able to analyze inputs $x^{\langle t \rangle}$ individually, and remember some context which gets transferred from one time-step to the next. This allows a unidirectional RNN to transfer the context from the past into future inputs.

## Understanding our Variables
______
**Notation used**:
- Superscript $[l]$ denotes an object associated with the $l^{th}$ layer. 

- Superscript $(i)$ denotes an object associated with the $i^{th}$ example. 

- Superscript $\langle t \rangle$ denotes an object at the $t^{th}$ time-step. 
    
- **Sub**script $i$ denotes the $i^{th}$ entry of a vector.

Example:  
- $a^{(4)[3]<2>}_1$ denotes the activation of the 4th training example (4), 3rd layer [3], 2nd time step <2>, and 1st entry in the vector.
___________
### Input $x$
#### Dimensions 

* Input with $len_x$ possible characters
    - For a single timestep of a single input example, $x^{(i) \langle t \rangle }$ is a vector with shape ($len_x$,0).
    - Here $len_x$ denotes the number of units in a single timestep of a single training example.
    - For example, in a language with a 100 word vocab_size $x^{(i) \langle t \rangle }$ could be one-hot encoded into a vector with shape (100,).
* Time steps of size $T_{x}$
* Mini-batches of size $m$
    - The inputs are separated into mini-batches to benefit from vectorization.
    - Each mini-batch has the shape $(len_x,m,T_x)$

$ Resultant Input x Dimension $:  $(len_{x},m,T_{x})$

The variable name for the 3D tensor $x$ in the code is `x`.<br>
The variable name for the 2D slice of $x$, referred to as $x^{\langle t \rangle}$, in the code is `xt`.

$EXAMPLE:$ With a vocabulary of size 100 characters, the sentence "This is an example" is of the shape $(100, 17)$. We can give our model the input "This " in the first time-step, which gives the first mini-batch the shape $(100, 5)$.<br>
In the second time-step, the mini-batch will be "his i" and will also have the shape $(100, 5)$.

### Hidden State $a$

* The activation $a^{\langle t \rangle}$ that is passed from one RNN time step to another is called a "hidden state."

#### Dimensions

* Similar to the input tensor $x$, the hidden state for a single training example is a vector of length $len_{a}$, with a mini-batch of $m$ training examples, .
* Therefore, the shape of the hidden state is $(len_{a}, m, T_x)$ when we take into consideration the time steps.
* At each time step with index $t$, we work with a 2D slice of the 3D tensor. We'll refer to this 2D mini-batch slice as $a^{\langle t \rangle}$, which has a shape of $(len_{a},m)$
* In the code, the variable names we use are either `a_prev`, for the hidden activations from the previous layer or `a_next`, for the hidden activations sent to the next layer.

### Prediction $\hat{y}$

- The inputs and the hidden states are used to make predictions of the specified output variable, represented as $\hat{y}$, while using the known output variable values to minimize a loss function.

#### Dimensions
* Similar to the inputs and hidden states, $\hat{y}$ is a 3D tensor of shape $(len_{y}, m, T_{y})$.
    * $len_{y}$: number of units in the vector representing the prediction.
    * $m$: number of characters in a mini-batch.
    * $T_{y}$: number of time steps in the prediction.
* For a single time step $t$, a 2D slice $\hat{y}^{\langle t \rangle}$ has shape $(len_{y}, m)$.
* In the code, the variable names are:
    - `y_pred`: $\hat{y}$ 
    - `yt_pred`: $\hat{y}^{\langle t \rangle}$

$EXAMPLE:$ With a vocabulary of size 100 characters, the sentence "This is an example" is of the shape $(100, 17)$. We can give our model the input "This " in the first time-step, which gives the output "his i". Therefore, the first input and output mini-batches have the shape $(100, 5)$.<br>
In the second time-step, the input mini-batch will be "his i" and the output mini-batch will be "is is". Both will  have the shape $(100, 5)$.

### Weights ($W_{ij}$) and Biases ($b_i$)

There are three kinds of weights in an RNN which are shared across timesteps. This means that at each new input the same weights and biases are being updated to best predict the output sequence.

- $W_{ax}$:<br>
    This is the weight matrix that parametrizes the connection between a hidden layer and an input layer.<br>
    **Dimensions**: $(hidden layers, vocabulary size)$ <br>
    **Variable Name**: `Weight_ax`<br><br>
- $W_{aa}$:<br>
    This is the weight matrix that parametrizes the connection between hidden layers.<br>
    **Dimensions**: $(hidden layers, hidden layers)$<br>
    **Variable Name**: `Weight_aa`<br><br>
- $W_{ya}$:<br>
    This is the weight matrix that paramterizes the connection between the output layer and the hidden layer.<br>
    **Dimensions**: $(vocabulary size, hidden layers)$<br>
    **Variable Name**: `Weight_ya`<br>

We have two bias terms too, which are
- $b_{a}$:<br>
    Bias term for the calculation of the value of a hidden layer.<br>
    **Dimensions**: $(hidden layers, 1)$<br>
    **Variable Name**: `bias_a`<br>
- $b_{y}$:<br>
    Bias term for the calculation of the value of a output layer.<br>
    **Dimensions**: $(vocabulary size, 1)$<br>
    **Variable Name**: `bias_y`<br>

## A - Overview of the model

- Create the parameters and initialize them
- Process the optimization loop
    - Forward propagation to compute the loss function
    - Backward propagation to compute the gradients wrt the loss function
    - Clip the gradients to avoid exploding gradients
    - Using the gradients, update your parameters with the gradient descent update rule.
- Return the learned parameters 
    
<img src="images/rnn.png" style="width:450;height:300px;">
<caption><center> **Figure 2** </center></caption>

* At each time-step, our RNN tries to predict what is the next character given the previous characters.
* The dataset $\mathbf{X} = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$ is a list of characters in the training set.
* $\mathbf{Y} = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$ is a similar list of characters but shifted one character forward.
* At every time-step $t$, $y^{\langle t \rangle} = x^{\langle t+1 \rangle}$.  In other words, the prediction at time $t$ is identical to the input at time $t + 1$.

## B - Building blocks of the model

Here we shall build the important components of an overall RNN model's Optimization Process:
- Forward Propagation Functions
- Backward Propagation Functions
- Gradient clipping: to avoid exploding gradients
- Calculate Loss

## 1. Forward Propagation

1. Implement the function needed for one time-step of the RNN cell
2. Implement a loop over $T_x$ time-steps in order to process all the inputs simultaneously

### Step 1 -  RNN Cell Forward Propagation

A RNN is essentially the repitition of a single cell

<img src="images/rnn_step_forward_figure2_v3a.png" style="width:700px;height:300px;">
<caption><center> **Figure 3**: Basic RNN cell. Takes as input $x^{\langle t \rangle}$ (current input) and $a^{\langle t - 1\rangle}$ (context from the previous states), and outputs $a^{\langle t \rangle}$ which is the context given to the next RNN cell and also used to predict $\hat{y}^{\langle t \rangle}$ </center></caption>

1. Compute the hidden state with tanh activation: $a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)$.
2. Using your new hidden state $a^{\langle t \rangle}$, compute the prediction $\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)$. I've used the function for `softmax` separately with the help of a blog post (link given in REFERENCE section below) $^1$
3. Return the latest hidden state $a^{\langle t \rangle}$ and prediction $\hat{y}^{\langle t \rangle}$

#### NOTE:

I am initializing the weights randomly in the interval from **[ -1/sqrt(len), 1/sqrt(len)]**, where len is the number of connections incoming from the preceding layer

In [16]:
def create_params(len_a, len_x, len_y):
    """
    Initialize params with small random numbers for values
    """
    np.random.seed(6)
    # weight for data from input to hidden
    Weight_ax = np.random.uniform(-np.sqrt(1./len_x),np.sqrt(1./len_x),(len_a,len_x))
    
    # weight for data from hidden to hidden
    Weight_aa = np.random.uniform(-np.sqrt(1./len_a),np.sqrt(1./len_a),(len_a,len_a))
    
    # weight for data from hidden to output
    Weight_ya = np.random.uniform(-np.sqrt(1./len_a),np.sqrt(1./len_a),(len_x,len_a))
    
    # hidden layer bias
    bias_a = np.zeros((len_a, 1))
    
    # output layer bias
    bias_y = np.zeros((len_y, 1))
    
    params = {"Weight_ax": Weight_ax, "Weight_aa": Weight_aa, "Weight_ya": Weight_ya, "bias_a": bias_a,"bias_y": bias_y}
    
    return params

In [17]:
def softmax(x):
    e_x = np.exp(x - np.max(x))
    return e_x / e_x.sum(axis=0)

In [18]:
def rnn_cell_fwd_propagation(xt, a_prev, params):
    """
    single forward step of the RNN-cell
    """
    
    # Retrieve params from "params"
    Weight_ax = params["Weight_ax"]
    Weight_aa = params["Weight_aa"]
    Weight_ya = params["Weight_ya"]
    bias_a = params["bias_a"]
    bias_y = params["bias_y"]
    
    # compute next activation state using the formula given in Step1.1 above
    a_next = np.tanh(np.dot(Weight_aa,a_prev) + np.dot(Weight_ax,xt) + bias_a)
    
    # compute output of the current cell using the formula given in Step1.2 above
    # it gives us the probability of the next character
    yt_pred = softmax(np.dot(Weight_ya,a_next)+bias_y)
    
    # Step 1.3
    return a_next, yt_pred

### Step 2 - RNN forward pass 

- A recurrent neural network (RNN) forward pass is a repetition of the RNN cell forward propagation that we've just built. 


<img src="images/rnn_forward_sequence_figure3_v3a.png" style="width:800px;height:180px;">
<caption><center> **Figure 4**: Basic RNN. The input sequence $x = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$  is carried over $T_x$ time steps. The network outputs $y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$. </center></caption>


**Step 2A - Initialize Variables**
* Create empty dictionaries to represent input $x$, hidden state $a$ and predictions $\hat{y}_{pred}$
* Add the base hidden state to the overall hidden state vector $a$, setting it equal to the initial hidden state, $a_{0}$.

**Step 2B - Loop**
* At each time step $t$:
    1. Get $x^{\langle t \rangle}$ and set it as a 2D slice of $x$ for the time step $t$
    2. Run `rnn_cell_fwd_propagation` to update this 2D hidden state $a^{\langle t \rangle}$ and get the prediction for the next most likely character $\hat{y}^{\langle t \rangle}$ along with the updated hidden state for the next time step $a^{\langle t+1 \rangle}$.

**Step 2C - Return the updated values**
* Return the dictionaries $x$, $a$ and $\hat{y}_{pred}$

In [19]:
def rnn_fwd_propagation(X, Y_actual, a_prev, params, vocab_size):
    """
    forward propagation through an entire RNN
    """
    # Step 2A
    # Initialize x, a and y_hat as empty dictionaries
    x, a, y_pred = {}, {}, {}
    
    # Add the base hidden state a0, contained in a_prev,
    # to the last position in the dictionary
    a[-1] = np.copy(a_prev)
    
    # Step 2B
    # Loop over all the time stamps of the input
    for t in range(len(X)):
        
        # Set x[t] to be the one-hot vector representation of the t'th character in X.
        # if X[t] == None, we just have x[t]=0. This is used to set the input for the first timestep to the zero vector. 
        x[t] = np.zeros((vocab_size,1)) 
        if (X[t] != None):
            x[t][X[t]] = 1
        
        # Run one step forward of the RNN Cell giving it the input x at timestep t,
        # the hidden state at timestep (t-1) and the weights in the params variable
        a[t], y_pred[t] = rnn_cell_fwd_propagation(x[t], a[t-1], params)
    
    # Step 2C
    return x, a, y_pred

## 2. Backward Propagation

### Step 1 - Basic RNN  backward pass

We will start by computing the backward pass for the basic RNN-cell and then in the following sections, iterate through the cells.

<img src="images/rnn_backward_overview_3a_1.png" style="width:500;height:300px;"> <br>
<caption><center> **Figure 5**: RNN-cell's backward pass where the derivative of the cost function $J$ backpropagates through the RNN by following the chain-rule.
    
The Chain Rule is also used to calculate $(\frac{\partial J}{\partial W_{ax}},\frac{\partial J}{\partial W_{aa}},\frac{\partial J}{\partial b})$ to update the params $(W_{ax}, W_{aa}, b_a)$. The operation will utilize the cached results from the forward propagation. </center></caption>

**NOTE:** the variable representation for the partial derivative of cost relative to a `variable` is `delta_Variable`. For example, $\frac{\partial J}{\partial W_{ax}}$ is $dW_{ax}$ and this is represented as `delta_Weight_ax`. This is used throughout the remaining sections.


<img src="images/rnn_cell_backward_3a_4.png" style="width:500;height:300px;"> <br>
<caption><center> **Figure 6**: This implementation of rnn_cell_backward does not include the output dense layer and softmax which are included in rnn_cell_fwd_propagation.

##### Equations
To compute the `rnn_cell_bwd_propagation` we utilize the following equations. Here, $*$ denotes element-wise multiplication while the absence of a symbol indicates matrix multiplication.

$a^{\langle t \rangle} = \tanh(W_{ax} x^{\langle t \rangle} + W_{aa} a^{\langle t-1 \rangle} + b_{a})\tag{alpha}$
 
$\displaystyle \frac{\partial \tanh(x)} {\partial x} = 1 - \tanh^2(x) \tag{beta}$
 
$\displaystyle db_a = da_{next} * \sum_{batch}( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) )\tag{1}$

$\displaystyle dW_{aa} = da_{next} * (( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) )  a^{\langle t-1 \rangle T}\tag{2}$

$\displaystyle  {dW_{ax}} = da_{next} * ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) ) x^{\langle t \rangle T}\tag{3}$
 
$\displaystyle da_{next} = da_{prev} * { W_{aa}}^T ( 1-\tanh^2(W_{ax}x^{\langle t \rangle}+W_{aa} a^{\langle t-1 \rangle} + b_{a}) )\tag{4}$

In [20]:
def rnn_cell_bwd_propagation(delta_y, grads, params, xt, at, a_tmin1):
    """
    The backward pass for the RNN-cell (single time-step).
    """
    
    # Retrieve values of weights and biases
    Weight_ax = params["Weight_ax"]
    Weight_aa = params["Weight_aa"]
    Weight_ya = params["Weight_ya"]
    bias_a = params["bias_a"]
    bias_y = params["bias_y"]
    
    # NOTE: .T is the transpose of the matrix it follows
    # We now move from the output layer towards the input layer
    # during backpropagation
    
    # gradient of the loss wrt Weight_ya between the output layer
    # and the last hidden layer, using the difference between 
    # y_pred and y_actual (contained in delta_y)
    grads['delta_Weight_ya'] += np.dot(delta_y, at.T)
    grads['delta_bias_y'] += delta_y
    
    # the gradient of the loss with respect to z is calculated by
    # first, differentiating the equation within the tanh() in 
    # Equation 'alpha' given above wrt a to get delta_a
    delta_a = np.dot(Weight_ya.T, delta_y) + grads['delta_a_next']
    # second, applying chain rule by multiplying it with the value of 
    # the Equation 'beta' given above to get delta_z through the chain rule
    delta_z = (1 - at*at)*delta_a
    # Calculate value of Equation 1 above
    grads['delta_bias_a'] += delta_z
    
    # compute the gradient of the loss wrt Weight of hidden states
    # (Equation 2 above)
    grads['delta_Weight_aa'] += np.dot(delta_z, a_tmin1.T)
    
    # gradient of the loss wrt Weight of input layer and hidden state
    # (Equation 3 above)
    grads['delta_Weight_ax'] += np.dot(delta_z, xt.T)
    
    # gradient of the loss wrt bias of hidden layer
    # (Equation 4 above)
    grads['delta_a_next'] = np.dot(Weight_aa.T, delta_z)
    
    return grads

### Step 2 - Backward pass through the RNN

To backpropagate the gradient through the network, we first create the variables, then iterate through all the time steps starting at the end, and at each step, we increment the overall $db_i$, $dW_{ij}$and store them.

In [21]:
def rnn_bwd_propagation(x, a, y_pred, X, Y_actual, params):
    """
    The backward pass for a RNN over an entire sequence of input
    """
    
    # Initialize gradients as an empty dictionary
    grads = {}
    
    # Retrieve from cache and params
    Weight_ax, Weight_aa, Weight_ya, bias_a, bias_y = params["Weight_ax"], params["Weight_aa"], params["Weight_ya"], params["bias_a"], params["bias_y"]
    
    # each gradient should be initialized to zeros of the same dimension as its corresponding parameter
    grads['delta_Weight_ax'], grads['delta_Weight_aa'], grads['delta_Weight_ya'] = np.zeros_like(Weight_ax), np.zeros_like(Weight_aa), np.zeros_like(Weight_ya)
    grads['delta_bias_a'], grads['delta_bias_y'] = np.zeros_like(bias_a), np.zeros_like(bias_y)
    grads['delta_a_next'] = np.zeros_like(a[0])
    
    # Backpropagate through time
    for t in reversed(range(len(X))):
        delta_y = np.copy(y_pred[t])
        #through softmax backpropagate into y
        delta_y[Y_actual[t]] -= 1
        # Run one iteration of backward propagation for a time step
        grads = rnn_cell_bwd_propagation(delta_y, grads, params, x[t], a[t], a[t-1])
    
    return grads

## 3 - Clipping the gradients during optimization

### Exploding gradients
* When gradients become very large, they are known as "exploding gradients."
* Exploding gradients make training harder because the updates end up being so large that they "overshoot" the optimal values during backward propagation.


### Gradient Clipping
Our function `gradient_clipper` will take a dictionary of gradients and will clip them if needed and return the result.
* We will clip every element of the gradient vector to make it lie between some range [-N, N].
* If any component of the gradient vector is greater/less than N/-N, then they are set to N/-N respectively.

<img src="images/clip.png" style="width:400;height:150px;">
<caption><center> **Figure 2**: In case we have a exploding gradient problem then our gradient descent might not be able to home down to the minima. </center></caption>

In [22]:
def gradient_clipper(grads, boundary):
    '''
    Clips the gradients' values between the -boundary (minimum)
    and the +boundary (maximum).
    '''
    
    delta_Weight_aa, delta_Weight_ax, delta_Weight_ya, delta_bias_a, delta_bias_y = grads['delta_Weight_aa'], grads['delta_Weight_ax'], grads['delta_Weight_ya'], grads['delta_bias_a'], grads['delta_bias_y']
    
    # Loop through all the gradients
    for gradient in [delta_Weight_ax, delta_Weight_aa, delta_Weight_ya, delta_bias_a, delta_bias_y]:
        np.clip(gradient, -boundary, boundary, out=gradient)
    
    grads = {"delta_Weight_aa": delta_Weight_aa, "delta_Weight_ax": delta_Weight_ax, "delta_Weight_ya": delta_Weight_ya, "delta_bias_a": delta_bias_a, "delta_bias_y": delta_bias_y}
    
    return grads

## 4 - Calculating Loss

In [23]:
def calculate_loss(X, Y_actual, y_pred):
    """
    Calculate the loss for a sequence
    """
    # calculate cross-entrpy loss
    return sum(-np.log(y_pred[t][Y_actual[t],0]) for t in range(len(X)))

## C - Building the language model 

### Gradient descent 

* We will build a function to perform a single step of Stochastic Gradient Descent (with clipped gradients). 

The steps of the `optimization` loop for an RNN:

- Forward propagating through the RNN and calculating the loss
- Backward propagating through time and calculating the gradients of the loss wrt the weights and biases
- Clipping these gradients
- Updating the weights and biases through gradient descent 

#### Parameters

* The weights and biases are stored inside the `params` variable, and are being updated through pass by reference
* "pass by reference" means that if we pass a variable into a function and modify it within the function, this changes that original variable and it doesn't have to returned as an output of the function

In [24]:
def update_params(params, grads, learning_rate):

    params['Weight_ax'] += -learning_rate * grads['delta_Weight_ax']
    params['Weight_aa'] += -learning_rate * grads['delta_Weight_aa']
    params['Weight_ya'] += -learning_rate * grads['delta_Weight_ya']
    params['bias_a']  += -learning_rate * grads['delta_bias_a']
    params['bias_y']  += -learning_rate * grads['delta_bias_y']
    return params

In [25]:
def optimization(X, Y_actual, a_prev, params, vocab_size, learning_rate):
    """
    Execute one step of the optimization to train the model.
    """
    
    # Forward propagation
    x, a, y_pred = rnn_fwd_propagation(X, Y_actual, a_prev, params, vocab_size)
    
    # Backward propagation
    grads = rnn_bwd_propagation(x, a, y_pred, X, Y_actual, params)
    
    # Gradient clipping between -5 (min) and 5 (max)
    grads = gradient_clipper(grads, 5)
    
    # Calculate Loss
    loss = calculate_loss(X, Y_actual, y_pred)
    
    # Update the weights and biases
    parameters = update_params(params, grads, learning_rate)
    
    return loss, grads, a[len(X)-1]

## D - Creating the Input Data

In [7]:
# Download Review Data from the following URL "https://www.dropbox.com/s/5t7udbxdua3xu2r/1429_1.csv?dl=1"
db_base = "1429_1.csv"

review_data = pd.read_csv(db_base)
review_data.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,id,name,asins,brand,categories,keys,manufacturer,reviews.date,reviews.dateAdded,reviews.dateSeen,...,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username
0,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,This product so far has not disappointed. My c...,Kindle,,,Adapter
1,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,great for beginner or experienced person. Boug...,very fast,,,truman
2,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,Inexpensive tablet for him to use and learn on...,Beginner tablet for our 9 year old son.,,,DaveZ
3,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,4.0,http://reviews.bestbuy.com/3545/5620406/review...,I've had my Fire HD 8 two weeks now and I love...,Good!!!,,,Shacks
4,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-12T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,I bought this for my grand daughter when she c...,Fantastic Tablet for kids,,,explore42


The input to the model will be a list of all the reviews on which the model will be trained. This is saved in the `model_input` variable

In [8]:
model_input = review_data['reviews.text'].dropna().to_list()
model_input

['This product so far has not disappointed. My children love to use it and I like the ability to monitor control what content they see with ease.',
 'great for beginner or experienced person. Bought as a gift and she loves it',
 'Inexpensive tablet for him to use and learn on, step up from the NABI. He was thrilled with it, learn how to Skype on it already...',
 "I've had my Fire HD 8 two weeks now and I love it. This tablet is a great value.We are Prime Members and that is where this tablet SHINES. I love being able to easily access all of the Prime content as well as movies you can download and watch laterThis has a 1280/800 screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing $900 base model. The build on this fire is INSANELY AWESOME running at only 7.7mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic tab in ur hands.",
 'I bought this for my grand daughter 

Let us clean the `model_input` to remove punctuations, emojis etc and change all the letters to lowercase

In [9]:
def clean_text(txt):
    txt = "".join(v for v in txt if v not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii", "ignore")
    return txt

# clean reviews, change them to lowercase and remove '\x7f' terms
model_input = [clean_text(text).lower() for text in model_input if '\x7f' not in text]
model_input

['this product so far has not disappointed my children love to use it and i like the ability to monitor control what content they see with ease',
 'great for beginner or experienced person bought as a gift and she loves it',
 'inexpensive tablet for him to use and learn on step up from the nabi he was thrilled with it learn how to skype on it already',
 'ive had my fire hd 8 two weeks now and i love it this tablet is a great valuewe are prime members and that is where this tablet shines i love being able to easily access all of the prime content as well as movies you can download and watch laterthis has a 1280800 screen which has some really nice look to it its nice and crisp and very bright infact it is brighter then the ipad pro costing 900 base model the build on this fire is insanely awesome running at only 77mm thick and the smooth glossy feel on the back it is really amazing to hold its like the futuristic tab in ur hands',
 'i bought this for my grand daughter when she comes ove

Lets now build the vocabulary for all these reviews and save the `vocab_size` for future use.

In [10]:
vocabulary = sorted(list(set(''.join(model_input)))+['\n'])
vocab_size = len(vocabulary)
print("There are", str(vocab_size),"unique characters in the", len(model_input),"reviews which are: \n", vocabulary)

There are 38 unique characters in the 34658 reviews which are: 
 ['\n', ' ', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


Create a dictionary `char_to_idx` to get the index in the vocabulary for a character in a review. We also build a dictionary `idx_to_char` to get the character from the index in the vocabulary.

In [11]:
char_to_idx = { ch:i for i,ch in enumerate(vocabulary) }
idx_to_char = { i:ch for i,ch in enumerate(vocabulary) }

In [12]:
char_to_idx

{'\n': 0,
 ' ': 1,
 '0': 2,
 '1': 3,
 '2': 4,
 '3': 5,
 '4': 6,
 '5': 7,
 '6': 8,
 '7': 9,
 '8': 10,
 '9': 11,
 'a': 12,
 'b': 13,
 'c': 14,
 'd': 15,
 'e': 16,
 'f': 17,
 'g': 18,
 'h': 19,
 'i': 20,
 'j': 21,
 'k': 22,
 'l': 23,
 'm': 24,
 'n': 25,
 'o': 26,
 'p': 27,
 'q': 28,
 'r': 29,
 's': 30,
 't': 31,
 'u': 32,
 'v': 33,
 'w': 34,
 'x': 35,
 'y': 36,
 'z': 37}

In [13]:
idx_to_char

{0: '\n',
 1: ' ',
 2: '0',
 3: '1',
 4: '2',
 5: '3',
 6: '4',
 7: '5',
 8: '6',
 9: '7',
 10: '8',
 11: '9',
 12: 'a',
 13: 'b',
 14: 'c',
 15: 'd',
 16: 'e',
 17: 'f',
 18: 'g',
 19: 'h',
 20: 'i',
 21: 'j',
 22: 'k',
 23: 'l',
 24: 'm',
 25: 'n',
 26: 'o',
 27: 'p',
 28: 'q',
 29: 'r',
 30: 's',
 31: 't',
 32: 'u',
 33: 'v',
 34: 'w',
 35: 'x',
 36: 'y',
 37: 'z'}

## E - Making Predictions using the Parameters

#### VARIABLES:

- **review_start**: The text of the review for which the future characters are to be predicted
- **chars_to_predict**: The number of characters to predict using the model
- **params**: T=A dictionary contatining the weights and biases of the model for prediction
- **char_to_idx**: A dictionary of our vocabulary to get indices for characters within the vocabulary
- **idx_to_char**: A dictionary of our vocabulary to get characters within the vocabulary from their indices

In [14]:
def predict(review_start, chars_to_predict, params, char_to_idx, idx_to_char):
    """
    Predict a sequence of characters to follow an input according to the probability distributions output of an RNN
    """
    # Clean up sample review for prediction
    review_start = clean_text(review_start).lower()
    
    # Retrieve weights, biases and relevant shapes from "params" dictionary
    Weight_ax, Weight_aa, Weight_ya, bias_a, bias_y = params["Weight_ax"], params["Weight_aa"], params["Weight_ya"], params["bias_a"], params["bias_y"]
    vocab_size = bias_y.shape[0]
    
    # Get hidden layer size from the weight_aa
    len_a = Weight_aa.shape[1]
    
    # We build a vector x that is used as the one-hot vector 
    # initializing the input review
    x = np.zeros((vocab_size,1))
    # break the review into list of characters
    chars_in_review_start = [char for char in review_start]
    # create empty list which will hold the indices of the generated review
    review_indices = []
    # loop through the characters in review and add them to input vector x
    # and add their indices from vocabulary to the generated review output list
    for i in range(len(chars_in_review_start)):
        index = char_to_idx[chars_in_review_start[i]]
        x[index] = 1
        review_indices.append(index)
    
    # Initialize a_prev as zeros
    a_prev = np.zeros((len_a, 1))
    
    # predict next character
    for t in range(chars_to_predict):
        
        # Forward propagate x to predict next character 
        a = np.tanh(np.dot(Weight_ax, x) + np.dot(Weight_aa, a_prev) + bias_a)
        z = np.dot(Weight_ya, a) + bias_y
        y = softmax(z)
        
        # Sample the index of a character from the vocabulary 
        # based on the probability distribution y obtained from softmax
        index_pred = np.random.choice(range(vocab_size), p= np.ravel(y))
        
        # Update the input x with one that corresponds to the sampled 
        # index `index_pred`. This is used as input to the next prediction
        x = np.zeros((vocab_size,1))
        x[index_pred] = 1
        
        # Add the predicted character's index to the review start
        review_indices.append(index_pred)
        
        # Update "a_prev" to be "a"
        a_prev = a
    
    predicted_review = ''.join(idx_to_char[i] for i in review_indices)
    return predicted_review

## F - Training the Model

* In the dataset of customer reviews, we use each review as one training example.
* Every 2000 steps of stochastic gradient descent, we will predict a sample review to see how the algorithm is doing. 
* We will shuffle the dataset, so that stochastic gradient descent visits the examples in random order. 

#### VARIABLES:

- **`model_input`**: The text of the review for which the future characters are to be predicted
- **`idx_to_char`**: A dictionary of our vocabulary to get characters within the vocabulary from their indices
- **`char_to_idx`**: A dictionary of our vocabulary to get indices for characters within the vocabulary
- **`vocab_size`**: Size of the vocabulary the model was trained on
- **`learning_rate`**: The step size in Gtochastic Gradient Descent $default=0.01$
- **`num_iterations`**: The number of reviews in our dataset that you want the model to be trained on. $default=30$<br>
NOTE-- epochs = num_iterations / num_reviews_in_training_set
- **`len_a`**: number of units of the RNN cell $default=100$
- **`sample_review`**: The start of the sample review for which the next `sample_length` characters are predicted to sample test the learning progress. **default="We really had the most amazing"**
- **`sample_length`**: The number of characters to predict in the sample review using the model. $default=30$

In [103]:
def train_model(model_input, idx_to_char, char_to_idx, vocab_size, learning_rate = 0.01, num_iterations = 100000, len_a = 100, sample_review = "We really had the most amazing", sample_length = 30):
    """
    Trains the model which generates reviews. 
    
    OUTPUT:
    params -- learned weights and biases for our RNN model
    """
    
    # Retrieve len_x and len_y from vocab_size
    len_x, len_y = vocab_size, vocab_size
    
    # Initialize parameters
    params = create_params(len_a, len_x, len_y)
    
    # Initialize loss (this is required because we want to smooth our loss)
    smooth_loss = -np.log(1.0/vocab_size)*1
    
    # Shuffle list of all reviews
    np.random.seed(0)
    np.random.shuffle(model_input)
    
    # Initialize the hidden state of our RNN
    a_prev = np.zeros((len_a, 1))
    
    # Optimization loop
    for j in range(num_iterations):
        # Set the index `idx` (see instructions above)
        idx = j % len(model_input)
        
        # Set the input X (see instructions above)
        review = model_input[idx]
        review_chars = [char for char in review]
        review_ix = [char_to_idx[char] for char in review_chars]
        X = [None]+review_ix
        
        # Set the labels Y (see instructions above)
        ix_new_line = char_to_idx['\n']
        Y_actual = X[1:]+[ix_new_line]
        
        # Perform one optimization step: Forward-prop -> Backward-prop -> Clip -> Loss -> Update parameters
        # I chose a learning rate of 0.01
        loss, grads, a_prev = optimization(X, Y_actual, a_prev, params, vocab_size, learning_rate)
        
        # Use a latency trick to keep the loss smooth. It happens here to accelerate the training.
        smooth_loss = smooth_loss*0.999 + loss*0.001
        
        # Print graphic to know progress
        print("Time to next sample: "+str(j%2000)+"/2000", end="\r")
        
        # Every 2000 Iteration, generate "sample_length" characters after the "sample_review"
        # using the predict() function to check if the model is learning properly
        if j % 2000 == 0:
            print('\n'+'Iteration: %d, Loss: %f' % (j, smooth_loss) + '\n')
            print("Review Start: "+sample_review+"\n")
            print("Prediction:", predict(sample_review, sample_length, params, char_to_idx, idx_to_char))
            print('\n')
    
    return params

In [106]:
params = train_model(model_input, idx_to_char, char_to_idx, vocab_size, learning_rate = 0.01, num_iterations = 200000, len_a = 40)

Review Start: We really had the most amazing

Prediction: we really had the most amazingu7q1r1a9nrdqq lsjhh t63kr81362

Iteration: 0, Loss: 3.834317



Review Start: We really had the most amazing

Prediction: we really had the most amazing owsed live louthe apriold res

Iteration: 2000, Loss: 276.751667



Review Start: We really had the most amazing

Prediction: we really had the most amazingate erich a fo theat not a blw

Iteration: 4000, Loss: 285.201178



Review Start: We really had the most amazing

Prediction: we really had the most amazingo i a booderd
ouspsint on inst

Iteration: 6000, Loss: 275.684595



Review Start: We really had the most amazing

Prediction: we really had the most amazinge it enjought almedi daystion 

Iteration: 8000, Loss: 288.216563



Review Start: We really had the most amazing

Prediction: we really had the most amazinge or then this curnyostey and 

Iteration: 10000, Loss: 289.038069



Review Start: We really had the most amazing

Prediction: we r

Review Start: We really had the most amazing

Prediction: we really had the most amazingass ther eapcwbought valrnesma

Iteration: 104000, Loss: 282.200366



Review Start: We really had the most amazing

Prediction: we really had the most amazingot  not for resoly theppery wr

Iteration: 106000, Loss: 273.105373



Review Start: We really had the most amazing

Prediction: we really had the most amazingever coweticinted phichlaturou

Iteration: 108000, Loss: 261.756817



Review Start: We really had the most amazing

Prediction: we really had the most amazingek hech ivimer an and a size t

Iteration: 110000, Loss: 258.914999



Review Start: We really had the most amazing

Prediction: we really had the most amazinges as to remern greating houn 

Iteration: 112000, Loss: 273.561494



Review Start: We really had the most amazing

Prediction: we really had the most amazingablend
its das for amazons eve

Iteration: 114000, Loss: 278.577147



Review Start: We really had the most amazing



In [107]:
#np.save('params40.npy', params)

# Generations

In [2]:
def load_params():
    return np.load('params40.npy', allow_pickle=True)[()]

params = load_params()

In [27]:
your_input = "Data Science is an amazing field because "

Run the cell below to generate next 30 character predictions for `your_input`

In [29]:
print("Prediction:", predict(your_input, 50, params, char_to_idx, idx_to_char))

Prediction: data science is an amazing field because e amazon and it daus it
grild is more fall celint 


# Conclusion

The pre-trained model weights that I saved, was trained with 40 RNN units for 200,000 iterations and this is ~6epochs of the 34,658 reviews. This itself took a total of 18 hours for training. Therefore, due to system limitations I am unable to train the model further and get better results. However, seeing that the reviews do not contain random numbers and the mords have slowly started to make sense individually, I believe that the model should perform better if we are able to train it for the appropriate number of epochs.

# REFERNCES

1. SOFTMAX SOURCE: https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/

2. Sequence Models - Deep Learning Specialization (Coursera): https://www.coursera.org/learn/nlp-sequence-models/home/welcome

This has been the backbone of the mathematics for my project and the source of the images.