## Task - 1: Understanding Sentiment Aalysis and RNNS

**Q1. What is sentiment analysis and it's applications?**

Sentiment analysis is the process of enabling a computer to determine the emotional tone—negative, neutral, or positive of a user's message based on digital text.

Applications of sentiment analysis:
- **Social Media Monitering:**
    - Tracks public opinion on brands, products or events.
    - Identify viral trends or customer sentiment shifts in real life.

- **Customer Feedeback Analysis:**
    - Analyze reviews to help understand satisfaction levels.
    - Prioritize complaints or negative experiences for faster resolution.

**Q2. How RNNs differ from traditional feedforward neural networks?**

A Recurrent Neural Network (RNN) is specially designed to handle sequential data or time series data, where the output at a particular time depends on previous inputs.

They remember previous inputs using internal memory, which helps in learning patterns over time. For example, traditional neural networks can recognize digits in images, while RNNs are used for tasks like predicting the next word in a sentence, speech recognition, or stock price forecasting, where the order and context of data matter.

**Q3. The concept of hidden states and how information is passed through time steps in RNNs.**

***Concept of Hidden States in RNNs***

In RNNs, a hidden state is a memory-like vector that stores information from previous time steps. It's the key component that allows RNNs to handle sequential data like sentences, time series, or speech.

***How Information is Passed Through Time Steps***
1. At **Time Step t = 1**:
    - Inputs: $x_1$
    - Hidden state: $h_0$ (usually initialized as zeros)
    - RNN Computes

    $$
    h_1 = \tanh(W_{xh} \cdot x_1 + W_{hh} \cdot h_0 + b)
    $$

    - $h_1$ now stores the information from $x_1$

2. At **Time Step t = 2**:
    - Input: $x_2$
    - Previous hidden state: $h_1$
    - RNN computes:

    $$
    h_2 = \tanh(W_{xh} \cdot x_2 + W_{hh} \cdot h_1 + b)
    $$

    - Now $h_2$ contains both $x_2$ and prior context $h_1$

3. This continous for all time steps:

$$
h_t = \tanh(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b)
$$

Each hidden state carries the essence of all previous inputs, making the network capable of remembering sequences.

**Q4. Common issues with RNNs such as vanishing and exploding gradients.**

RNNs often face two major issues:

- ***Vanishing gradients***: Gradients become too small during training, making it hard to learn long-term dependencies.
- ***Exploding gradients***: Gradients become too large, causing unstable training or model crash.

These problems occur during backpropagation through many time steps.

## Task - 2: Dataset Preparation

#### Loading the IMDB dataset from the TensorFlow

In [8]:
# Importing suitable packages
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
# Loading the dataset, the tokenization is already done by the TensorFlow
vocab_size = 10_000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words = vocab_size)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1us/step


In [4]:
# Shape of the training and test datasets
print(f"Shape of the X_train dataset: {x_train.shape} and Y_train: {y_train.shape}")
print(f"Shape of the X_test dataset: {x_test.shape} and Y_test: {y_test.shape}")

Shape of the X_train dataset: (25000,) and Y_train: (25000,)
Shape of the X_test dataset: (25000,) and Y_test: (25000,)


In [7]:
# First few examples from the train dataset
for i in range(5):
    print(f"{x_train[i]}: {y_train[i]}")

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]: 1
[1, 194, 1153, 194, 8255, 78, 228, 5, 6, 1463, 4369, 5012

In [10]:
# Finding the maximum review length
maxlen_review = max(len(review) for review in x_train)
print(f"Maximum review length in training data: {maxlen_review}")

Maximum review length in training data: 2494


> **Note**: Using the *maximum length* can lead to large input sizes and **slower training**, especially if a few reviews are extremely long.

In [11]:
# Setting maxlen to the 90th percentile of review lengths to capture most data while avoiding very long outliers
review_lengths = [len(review) for review in x_train]
maxlen = int(np.percentile(review_lengths, 90))

print(f"Using maxlen = {maxlen} for padding.")

Using maxlen = 467 for padding.


In [12]:
# Applying the padding on the training and testing data
x_train_padded = pad_sequences(x_train, maxlen = maxlen, padding = "post", truncating = "post")
x_test_padded = pad_sequences(x_test, maxlen = maxlen, padding = "post", truncating = "post")

In [13]:
print(f"Shape of padded x_train: {x_train_padded.shape}")
print(f"Shape of padded x_test: {x_test_padded.shape}")

Shape of padded x_train: (25000, 467)
Shape of padded x_test: (25000, 467)
