This code implements a simple sentiment analysis model using GRU with the following components:

Tokenizer: Converts text to sequences of integers based on word frequency.
pad_sequences: Ensures all sequences have the same length by padding or truncating.
Embedding: Maps words to dense vector representations.
GRU: Gated Recurrent Unit, a type of RNN for sequential data.
Dense: Fully connected layer for classification.

In [1]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


2024-12-13 12:43:28.862663: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-12-13 12:43:29.038817: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-13 12:43:29.125562: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-13 12:43:29.152462: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-13 12:43:29.304618: I tensorflow/core/platform/cpu_feature_guar

sentences: A mix of positive and negative reviews.
labels: Corresponding sentiment labels (1 for positive, 0 for negative).

In [11]:
sentences = [
    "I love this product",
    "This is the worst experience",
    "Absolutely amazing!",
    "Not good at all",
    "I'm very happy with this",
    "Terrible service",
    "Excellent quality",
    "Awful, never buying again",
    "Best purchase ever",
    "Really bad, very disappointed"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

Tokenizer:

    Assigns unique integers to each word.
    Example: "I love this product" → [1, 2, 3, 4].

pad_sequences:

    Pads/truncates sequences to the same length (max_length).
    Example: [1, 2, 3] → [1, 2, 3, 0] (padding with zeros).

In [12]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
max_length = max(len(s) for s in sequences)
X = pad_sequences(sequences, maxlen=max_length, padding='post')
y = np.array(labels)


Embedding Layer:

    Converts each word into a 16-dimensional dense vector.
    Vocabulary size = len(tokenizer.word_index) + 1 (words + padding token).

GRU Layer:

    Processes sequences to capture temporal relationships.
    16 units control the model's capacity.

Dense Layer:

    Outputs a single value between 0 and 1 (sigmoid activation).
    Represents the probability of the review being positive.

In [13]:
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=16, input_length=max_length),
    GRU(16),  
    Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])


    Loss Function: binary_crossentropy for binary classification tasks.
    Optimizer: adam for adaptive learning.
    Training: Runs for 10 epochs with batch size 2.

In [14]:
model.fit(X, y, epochs=10, batch_size=2, verbose=1)

Epoch 1/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.4667 - loss: 0.6935
Epoch 2/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.5000 - loss: 0.6900 
Epoch 3/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - accuracy: 0.7014 - loss: 0.6837 
Epoch 4/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.6736 - loss: 0.6799 
Epoch 5/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.3806 - loss: 0.6876     
Epoch 6/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 1.0000 - loss: 0.6766 
Epoch 7/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 1.0000 - loss: 0.6732 
Epoch 8/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 1.0000 - loss: 0.6651 
Epoch 9/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m

<keras.src.callbacks.history.History at 0x72fc7413adb0>

Tokenizing and Padding Test Sentences:

    Converts new reviews to sequences.
    Pads to match the training input size.

Making Predictions:

    Outputs a probability for each review being positive.
    Converts probabilities to sentiments: >0.5 → "Positive", <=0.5 → "Negative".

In [15]:
test_sentences = [
    "I hate this product",
    "The service was excellent",
    "Not worth the money",
    "Absolutely fantastic experience!"
]
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding='post')
predictions = model.predict(test_padded)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 354ms/step


In [16]:
for i, sentence in enumerate(test_sentences):
    sentiment = "Positive" if predictions[i][0] > 0.5 else "Negative"
    print(f"Review: {sentence}")
    print(f"Predicted Sentiment: {sentiment}")


Review: I hate this product
Predicted Sentiment: Positive
Review: The service was excellent
Predicted Sentiment: Positive
Review: Not worth the money
Predicted Sentiment: Positive
Review: Absolutely fantastic experience!
Predicted Sentiment: Positive


Add more data: Improve accuracy by training on a larger dataset.
Tuning Hyperparameters: Adjust GRU units, embedding_dim, batch_size, and epochs.
Regularization: Add dropout to avoid overfitting.