<a href="https://colab.research.google.com/github/KhotNoorin/Deep-Learning/blob/main/Recurrent_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recurrent Neural Network (RNN) in Deep Learning


Recurrent Neural Networks (RNNs) are a class of neural networks that are well-suited for processing sequential data such as time series, natural language, audio, and video. Unlike traditional feedforward neural networks, RNNs have connections that form directed cycles, allowing them to maintain a hidden state and capture temporal dependencies.


- **Sequential Processing**: RNNs process inputs one step at a time, maintaining a memory (hidden state) of previous steps.
- **Hidden State**: The hidden state acts as the memory of the network, enabling it to carry information across timesteps.
- **Weights Sharing**: The same weights are applied at each timestep, which reduces the number of parameters and enables learning from sequences of arbitrary lengths.
- **Backpropagation Through Time (BPTT)**: The training algorithm used for RNNs is a modified version of backpropagation that accounts for the sequential nature of data.

## RNN Architecture

At each timestep t, the RNN performs the following computations:

- Hidden state update:
  \[
  h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h)
  \]

- Output (optional):
  \[
  y_t = W_{hy} \cdot h_t + b_y
  \]

Where:
- \( x_t \): input at time t
- \( h_t \): hidden state at time t
- \( W_{xh}, W_{hh}, W_{hy} \): weight matrices
- \( b_h, b_y \): bias terms

## Applications

- Natural Language Processing (NLP): text generation, sentiment analysis, translation
- Time Series Forecasting: stock price prediction, weather forecasting
- Speech Recognition
- Music Generation

## Disadv

- **Vanishing/Exploding Gradients**: During BPTT, gradients can shrink or grow exponentially, making training unstable for long sequences.
- **Short-Term Memory**: Standard RNNs struggle to capture long-range dependencies.

## Solutions to RNN

To address the limitations of standard RNNs, advanced architectures were developed:
- **Long Short-Term Memory (LSTM)**
- **Gated Recurrent Unit (GRU)**

These architectures introduce gates to better control the flow of information and retain memory over longer sequences.

## Summary

RNNs are foundational models for sequence data in deep learning. While they have limitations with long-term dependencies, they serve as a basis for more advanced recurrent models like LSTMs and GRUs, which have significantly improved performance in many applications.


In [1]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense, Embedding
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split

In [2]:
num_samples = 1000
sequence_length = 5
vocab_size = 11  # (1 to 10)
num_classes = 2

In [3]:
X = np.random.randint(1, vocab_size, size=(num_samples, sequence_length))
y = np.sum(X, axis=1) % num_classes  # simple rule to create binary labels
y = to_categorical(y, num_classes=num_classes)

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# Model definition
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=8, input_length=sequence_length))
model.add(SimpleRNN(units=32, activation='tanh'))
model.add(Dense(num_classes, activation='softmax'))



In [6]:
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [7]:
# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

Epoch 1/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 38ms/step - accuracy: 0.4998 - loss: 0.6944 - val_accuracy: 0.4700 - val_loss: 0.6956
Epoch 2/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.5026 - loss: 0.6918 - val_accuracy: 0.5050 - val_loss: 0.6975
Epoch 3/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.5278 - loss: 0.6883 - val_accuracy: 0.4800 - val_loss: 0.6966
Epoch 4/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.5648 - loss: 0.6853 - val_accuracy: 0.4850 - val_loss: 0.6995
Epoch 5/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.5601 - loss: 0.6817 - val_accuracy: 0.4800 - val_loss: 0.7039
Epoch 6/10
[1m25/25[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.5732 - loss: 0.6772 - val_accuracy: 0.4900 - val_loss: 0.7067
Epoch 7/10
[1m25/25[0m [32m━━━━━━━━━

In [8]:
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Accuracy: {accuracy:.4f}')

[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.4585 - loss: 0.7132 
Test Accuracy: 0.4700


# RNN forward propagation

RNN forward propagation involves computing hidden states and outputs over a sequence using shared weights. This mechanism gives RNNs the ability to model sequential and time-dependent data effectively.

In [9]:
import numpy as np

In [10]:
input_size = 4    # Size of input vector (features)
hidden_size = 5   # Size of hidden state
output_size = 2   # Size of output vector
timesteps = 6     # Length of input sequence

In [11]:
# Create dummy input sequence (batch size = 1 for simplicity)
np.random.seed(0)
x = np.random.randn(timesteps, input_size)

In [12]:
# Initialize weights and biases
Wxh = np.random.randn(hidden_size, input_size)   # Input to hidden
Whh = np.random.randn(hidden_size, hidden_size)  # Hidden to hidden
Why = np.random.randn(output_size, hidden_size)  # Hidden to output

In [13]:
bh = np.zeros((hidden_size, 1))  # Bias for hidden layer
by = np.zeros((output_size, 1))  # Bias for output layer

In [14]:
# Initialize the initial hidden state
h_prev = np.zeros((hidden_size, 1))

In [15]:
# Store all hidden states and outputs
hidden_states = []
outputs = []

In [16]:
# Forward propagation through time
for t in range(timesteps):
    xt = x[t].reshape(-1, 1)  # Shape (input_size, 1)

    # Hidden state calculation
    ht = np.tanh(np.dot(Wxh, xt) + np.dot(Whh, h_prev) + bh)

    # Output calculation
    yt = np.dot(Why, ht) + by

    # Store results
    hidden_states.append(ht)
    outputs.append(yt)

    # Update hidden state for next time step
    h_prev = ht

In [17]:
# Convert outputs to numpy arrays for easier viewing
outputs = np.array(outputs).squeeze()

In [18]:
print("Final output from RNN (one for each timestep):\n", outputs)

Final output from RNN (one for each timestep):
 [[ 1.36210111 -0.04806836]
 [-0.38264111  1.23505342]
 [ 0.47215486 -1.39426624]
 [-0.66213135  2.32535818]
 [ 2.92316993  0.1776373 ]
 [-2.84506174 -0.23812638]]


# RNN Sentiment Analysis:


Sentiment analysis is a Natural Language Processing (NLP) task that involves determining the emotional tone behind a body of text. It is commonly used in applications such as product reviews, social media monitoring, and customer feedback analysis.

Recurrent Neural Networks (RNNs) are well-suited for sentiment analysis because they can process and learn from sequential data such as text. In Keras, RNN layers like SimpleRNN, LSTM, or GRU can be used to capture the contextual information from a sequence of words.


## RNN Architecture for Sentiment Analysis

- **Embedding Layer**: Converts word indices to dense vectors of fixed size.
- **SimpleRNN Layer**: Processes the embedded sequence one step at a time and maintains a hidden state.
- **Dense Output Layer**: Produces a binary prediction (positive or negative sentiment).

## Benefits of Using RNN for Sentiment Analysis

- Maintains context from earlier parts of the text due to its hidden state mechanism.
- Learns the structure and patterns of language over sequences.
- Outperforms traditional methods like bag-of-words in capturing word order and context.

## Limitations

- Standard RNNs may suffer from vanishing gradients and struggle with long sequences.
- Better alternatives like LSTM and GRU are often preferred in practice for deeper context retention.

## Summary

RNNs in Keras provide a powerful and flexible way to perform sentiment analysis on sequential text data. While simple RNNs can work for short sequences, LSTMs or GRUs are usually recommended for improved performance on longer texts.


In [35]:
docs = [
    'go india',
    'india india',
    'hip hip hurray',
    'jeetega bhai jeetega india jeetega',
    'bharat mata ki jai',
    'kohli kohli',
    'sachin sachin',
    'dhoni dhoni',
    'modi ji ki jai',
    'inquilab zindabad'
]

In [36]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding

In [37]:
# Build and summarize the model
vocab_size = len(tokenizer.word_index) + 1  # Always +1 for reserved 0 index
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=2, input_length=5))
model.summary()

In [39]:
# Dummy compile and prediction
model.compile(optimizer='adam', loss='mse')  # 'accuracy' is not a loss function
pred = model.predict(sequences)
print("Predictions shape:", pred.shape)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step
Predictions shape: (10, 5, 2)


In [40]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

In [41]:
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=10000)

In [42]:
# Pad sequences
X_train = pad_sequences(X_train, padding='post', maxlen=50)
X_test = pad_sequences(X_test, padding='post', maxlen=50)

In [43]:
# Build RNN model
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=2, input_length=50))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))

In [44]:
model.summary()

In [45]:
# Compile and train
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=5, validation_data=(X_test, y_test))

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 19ms/step - accuracy: 0.5208 - loss: 0.6879 - val_accuracy: 0.7700 - val_loss: 0.4825
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 18ms/step - accuracy: 0.8018 - loss: 0.4366 - val_accuracy: 0.8157 - val_loss: 0.4137
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 20ms/step - accuracy: 0.8577 - loss: 0.3368 - val_accuracy: 0.7900 - val_loss: 0.4755
Epoch 4/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 18ms/step - accuracy: 0.8892 - loss: 0.2807 - val_accuracy: 0.8053 - val_loss: 0.4723
Epoch 5/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 18ms/step - accuracy: 0.9035 - loss: 0.2474 - val_accuracy: 0.8016 - val_loss: 0.4688


# Backpropagation RNN

Backpropagation in RNNs is an extension of the backpropagation algorithm used in feedforward neural networks. It is called **Backpropagation Through Time (BPTT)** because RNNs handle sequential data where the output at each time step depends not only on the current input but also on previous inputs.

## Why?

Since RNNs have a temporal dimension, the weights are shared across time steps. Hence, to train RNNs, we need to propagate the error through each time step in reverse order — this process is known as **Backpropagation Through Time**.

## Steps

1. **Forward Pass**:
   - For each time step \( t \), compute the hidden state \( h_t \) and the output \( y_t \) using:
     \[
     h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)
     \]
     \[
     y_t = W_{hy} h_t + b_y
     \]

2. **Loss Calculation**:
   - Compute the total loss over all time steps:
     \[
     L = \sum_{t=1}^{T} \mathcal{L}(y_t, \hat{y}_t)
     \]

3. **Backward Pass (BPTT)**:
   - Compute the gradients of the loss with respect to weights by applying the chain rule backward through time.
   - Since the same weights are used at every time step, the gradients accumulate over time:
     \[
     \frac{\partial L}{\partial W} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W}
     \]

## Challenges in BPTT

- **Vanishing Gradient**: Gradients shrink as they are propagated backward, leading to poor learning in long sequences.
- **Exploding Gradient**: Gradients grow exponentially, causing unstable training.

## Solutions

- **Gradient Clipping**: Caps the gradients during backpropagation to avoid explosion.
- **Use of LSTM/GRU**: Gated architectures help in retaining long-term dependencies better than vanilla RNNs.

## Summary

Backpropagation in RNNs involves computing gradients across time steps due to the sequential nature of the data. BPTT is the adapted algorithm that handles temporal dependencies and trains the shared weights across time effectively.


In [1]:
import numpy as np

In [2]:
# Set random seed for reproducibility
np.random.seed(42)

In [3]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [4]:
def dsigmoid(x):
    return sigmoid(x) * (1 - sigmoid(x))


In [5]:
def tanh(x):
    return np.tanh(x)

In [6]:
def dtanh(x):
    return 1 - np.tanh(x) ** 2

In [7]:
# Sample dataset: input sequence of 3 time steps, each with 2 features
X = [np.array([[1], [0]]), np.array([[0], [1]]), np.array([[1], [1]])]
Y = [np.array([[1]]), np.array([[0]]), np.array([[1]])]

In [8]:
# RNN parameters
input_size = 2
hidden_size = 4
output_size = 1
learning_rate = 0.1

In [9]:
# Weight initialization
Wxh = np.random.randn(hidden_size, input_size) * 0.01  # input to hidden
Whh = np.random.randn(hidden_size, hidden_size) * 0.01  # hidden to hidden
Why = np.random.randn(output_size, hidden_size) * 0.01  # hidden to output
bh = np.zeros((hidden_size, 1))  # hidden bias
by = np.zeros((output_size, 1))  # output bias

In [10]:
# Training for 1 epoch (can loop over multiple epochs)
h_prev = np.zeros((hidden_size, 1))
hs, ys, ps = {}, {}, {}
hs[-1] = h_prev
loss = 0

In [12]:
# Forward Pass
for t in range(len(X)):
    x_t = X[t]
    hs[t] = tanh(np.dot(Wxh, x_t) + np.dot(Whh, hs[t-1]) + bh)
    ys[t] = np.dot(Why, hs[t]) + by
    ps[t] = sigmoid(ys[t])
    loss += 0.5 * (ps[t] - Y[t]) ** 2  # MSE Loss

In [13]:
print("Initial Loss:", np.sum(loss))

Initial Loss: 0.7499675916291635


In [15]:
# Backward Pass (BPTT)
dWxh = np.zeros_like(Wxh)
dWhh = np.zeros_like(Whh)
dWhy = np.zeros_like(Why)
dbh = np.zeros_like(bh)
dby = np.zeros_like(by)
dh_next = np.zeros_like(hs[0])

In [16]:
for t in reversed(range(len(X))):
    dy = (ps[t] - Y[t]) * dsigmoid(ys[t])  # output error
    dWhy += np.dot(dy, hs[t].T)
    dby += dy

    dh = np.dot(Why.T, dy) + dh_next  # backprop into h
    dh_raw = dh * dtanh(hs[t])
    dbh += dh_raw
    dWxh += np.dot(dh_raw, X[t].T)
    dWhh += np.dot(dh_raw, hs[t-1].T)
    dh_next = np.dot(Whh.T, dh_raw)

In [17]:
# Clip gradients to prevent exploding gradients
for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -1, 1, out=dparam)

In [18]:
# Update weights
Wxh -= learning_rate * dWxh
Whh -= learning_rate * dWhh
Why -= learning_rate * dWhy
bh -= learning_rate * dbh
by -= learning_rate * dby

In [19]:
print("Updated weights and biases after 1 BPTT step.")

Updated weights and biases after 1 BPTT step.
