# Practical 7: Sentiment Analysis
### Instructor:  Dr. Maryam Movahedifar

<div style="display: flex; justify-content: space-between; align-items: center;">
  <span style="display: flex; align-items: center;">
    <b>Applied Text Mining - University of Bremen - Data Science Center</b>
  </span>
  <div style="display: flex; align-items: center; margin-left: auto;">
    <img src="Uni_Logo.png" alt="Uni Logo" style="width: 100px; margin-right: 10px;">
    <img src="DSC_Logo.png" alt="DSC Logo" style="width: 150px;">
  </div>
</div>

In this practical, we will apply both dictionary- and deep learning-based sentiment analysis approaches on the IMDB sentiment classification task.

We are going to use the following libraries. Take care to have them installed!


In [1]:
!rm -rf ~/.cache

In [2]:
# Install packages (run once, outside Python script or in a cell)
!pip install --upgrade -q numpy scipy tensorflow tensorflow-hub tensorflow-datasets scikit-learn vaderSentiment pandas scikeras keras matplotlib

In [3]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.datasets import fetch_20newsgroups
from sklearn.preprocessing import LabelEncoder

# Use only TensorFlow Keras imports (do NOT mix with standalone keras)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers, utils
from tensorflow.keras.utils import to_categorical

from scikeras.wrappers import KerasClassifier


2025-07-10 16:27:23.254374: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-07-10 16:27:23.287338: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-10 16:27:23.401123: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-07-10 16:27:23.425586: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1752157643.447836     129 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1752157643.45

# Let's get started! 

Here we are going to classify movie reviews as positive or negative using the text of the review. We will use the IMDB dataset that contains the text of 50,000 movie reviews from the Internet Movie Database (IMDb). These are split into 25,000 reviews for training and 25,000 reviews for testing. The training and test sets are balanced, meaning they contain an equal number of positive and negative reviews.

1. **The IMDB dataset is available on TensorFlow datasets. Use the following code to download the IMDB dataset.**


In [4]:
# Split the training set into 60% and 40% to end up with 15,000 examples
# for training, 10,000 examples for validation and 25,000 examples for testing.
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews",
    split=('train[:60%]', 'train[60%:]', 'test'),
    as_supervised=True)

2025-07-10 16:27:31.863966: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)


2. **Use the following code to explore the data and print the first 4 examples.**

In [5]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(4)))
train_examples_batch

2025-07-10 16:27:32.145400: I tensorflow/core/kernels/data/tf_record_dataset_op.cc:387] The default buffer size is 262144, which is overridden by the user specified `buffer_size` of 8388608
2025-07-10 16:27:32.177271: W tensorflow/core/kernels/data/cache_dataset_ops.cc:916] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


<tf.Tensor: shape=(4,), dtype=string, numpy=
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.",
       b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell a

In [6]:
train_labels_batch

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([0, 0, 0, 1])>

The label is an integer value of either 0 or 1, where 0 is a negative review, and 1 is a positive review.

## **Lexicon-based sentiment analysis**

Vader (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, and works well on texts from other domains.

The [VADER lexicon](https://www.kaggle.com/datasets/nltkdata/vader-lexicon) is empirically validated by multiple independent human judges. VADER incorporates a "gold-standard" sentiment lexicon that is especially attuned to microblog-like contexts.

### Advantages:
- Unsupervised  
- Fast and deployable  
- Reasonable performance even without preprocessing  

### Disadvantages:
- It is a rule-based approach, meaning it utilizes a list of predefined polarity scores for each word  
- It cannot exceed beyond a certain performance compared to state-of-the-art NLP approaches  


3. **Create a Vader analyzer using the `SentimentIntensityAnalyzer` function, and look at the polarity scores of some example sentences.**

In [7]:
analyzer = SentimentIntensityAnalyzer()
print(analyzer.polarity_scores("you cannot be negative"))

{'neg': 0.0, 'neu': 0.5, 'pos': 0.5, 'compound': 0.4585}


The output is 50% positive ad 50% neutral. The compound (overall sentiment) score is 0.4585.

4. **Calculate the compound sentiment scores of the first 1,000 training data. Convert the final scores to 0 (negative) and 1 (positive).**

In [8]:
train_examples_batch, train_labels_batch = next(iter(train_data.batch(1000)))

2025-07-10 16:27:32.480027: W tensorflow/core/kernels/data/cache_dataset_ops.cc:916] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset  will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.


In [9]:
score = [0 for x in range(1000)]
for i in range(1000):
    text = train_examples_batch.numpy()[i].decode("utf-8")
    sent = analyzer.polarity_scores(text)['compound']
    if(sent > 0):
        score[i] = 1

5. **Evaluate the performance of the predicted sentiment socres using the `classification_report` function. How do you analyze your results?**

In [10]:
print(metrics.classification_report(train_labels_batch, score, target_names=['negative', 'positive']))

              precision    recall  f1-score   support

    negative       0.78      0.53      0.63       490
    positive       0.66      0.85      0.74       510

    accuracy                           0.70      1000
   macro avg       0.72      0.69      0.69      1000
weighted avg       0.71      0.70      0.69      1000



## **Deep learning-based sentiment analysis**

In this part of the practical, we are going to use pre-trained word embedding models from [TensorFlow Hub](https://tfhub.dev/) to do sentiment classification on movie reviews. TensorFlow Hub is a repository of trained machine learning models.

6. **Use a pre-trained model from TensorFlow Hub called `"google/nnlm-en-dim50/2"`**

In [11]:
# Token based text embedding trained on English Google News 7B corpus.
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"

hub_layer = hub.KerasLayer(embedding, input_shape=[],
                           dtype=tf.string, trainable=True)
hub_layer(train_examples_batch[:3])

<tf.Tensor: shape=(3, 50), dtype=float32, numpy=
array([[ 0.5423195 , -0.0119017 ,  0.06337538,  0.06862972, -0.16776837,
        -0.10581174,  0.16865303, -0.04998824, -0.31148055,  0.07910346,
         0.15442263,  0.01488662,  0.03930153,  0.19772711, -0.12215476,
        -0.04120981, -0.2704109 , -0.21922152,  0.26517662, -0.80739075,
         0.25833532, -0.3100421 ,  0.28683215,  0.1943387 , -0.29036492,
         0.03862849, -0.7844411 , -0.0479324 ,  0.4110299 , -0.36388892,
        -0.58034706,  0.30269456,  0.3630897 , -0.15227164, -0.44391504,
         0.19462997,  0.19528408,  0.05666234,  0.2890704 , -0.28468323,
        -0.00531206,  0.0571938 , -0.3201318 , -0.04418665, -0.08550783,
        -0.55847436, -0.23336391, -0.20782952, -0.03543064, -0.17533456],
       [ 0.56338924, -0.12339553, -0.10862679,  0.7753425 , -0.07667089,
        -0.15752277,  0.01872335, -0.08169781, -0.3521876 ,  0.4637341 ,
        -0.08492756,  0.07166859, -0.00670817,  0.12686075, -0.19326553,
 

Here you see that no matter the length of the input text, the output shape of the embeddings is: `(num_examples, embedding_dimension)`.

7. **Build a deep learning model using the embedding layer and one hidden layer.**

Use `Lambda` layer to wrap the TensorFlow Hub model, allowing it to be used in Sequential API.
    
This does not change the output of the hub model — it's just a wrapper to integrate it as a Keras layer.


In [17]:
model = tf.keras.Sequential([
    tf.keras.layers.Lambda(lambda x: hub_layer(x)),  # Wrap hub layer here
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Build the model by passing a sample batch of string inputs (required for summary to work)
sample_text = tf.constant(["This is a sample input."])
model(sample_text)

model.summary()

8. **Compile and train the model for 10 epochs in batches of 512 samples.**

In [18]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [19]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 131ms/step - accuracy: 0.5346 - loss: 0.6934 - val_accuracy: 0.6686 - val_loss: 0.6429
Epoch 2/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 73ms/step - accuracy: 0.6784 - loss: 0.6307 - val_accuracy: 0.7062 - val_loss: 0.5951
Epoch 3/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 68ms/step - accuracy: 0.7135 - loss: 0.5842 - val_accuracy: 0.7219 - val_loss: 0.5655
Epoch 4/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 70ms/step - accuracy: 0.7262 - loss: 0.5617 - val_accuracy: 0.7331 - val_loss: 0.5452
Epoch 5/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 65ms/step - accuracy: 0.7360 - loss: 0.5427 - val_accuracy: 0.7411 - val_loss: 0.5333
Epoch 6/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 63ms/step - accuracy: 0.7388 - loss: 0.5337 - val_accuracy: 0.7440 - val_loss: 0.5273
Epoch 7/10
[1m30/30[0m [32m━━━

9. **Evaluate the model on the test set.**

In [20]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 5s - 95ms/step - accuracy: 0.7416 - loss: 0.5230
loss: 0.523
compile_metrics: 0.742


This fairly simple approach achieves an accuracy of about 75%.

10. **For your next experiment load a more complex pretrained word embedding for the embedding layer. Train and evaluate your model.**

In [21]:
embedding = "https://tfhub.dev/google/nnlm-en-dim128-with-normalization/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[],
                           dtype=tf.string, trainable=True)
# hub_layer(train_examples_batch[:3])

Here we tried google/nnlm-en-dim128-with-normalization/2 — trained with the same NNLM (Neural Network Language Model) architecture on the same data as google/nnlm-en-dim50/2, but with a larger embedding dimension. Larger dimensional embeddings can improve on your task but it may take longer to train your model. This new model has additional text normalization such as removing punctuation. This can help if the text in your task contains additional characters or punctuation. You can try more pretrained embeddings from [TensorFlow Hub](https://www.kaggle.com/models?tfhub-redirect=true), for example BERT, but remember that these are huge models and need a lot of training time.




In [22]:
model = tf.keras.Sequential([
    tf.keras.layers.Lambda(lambda x: hub_layer(x)),  # Wrap hub layer here
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Build the model by passing a sample batch of string inputs (required for summary to work)
sample_text = tf.constant(["This is a sample input."])
model(sample_text)

model.summary()

In [23]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [24]:
history = model.fit(train_data.shuffle(10000).batch(512),
                    epochs=10,
                    validation_data=validation_data.batch(512),
                    verbose=1)

Epoch 1/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 89ms/step - accuracy: 0.5728 - loss: 0.6788 - val_accuracy: 0.7341 - val_loss: 0.5907
Epoch 2/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 82ms/step - accuracy: 0.7469 - loss: 0.5675 - val_accuracy: 0.7755 - val_loss: 0.5121
Epoch 3/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 84ms/step - accuracy: 0.7870 - loss: 0.4940 - val_accuracy: 0.7914 - val_loss: 0.4678
Epoch 4/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 79ms/step - accuracy: 0.7937 - loss: 0.4611 - val_accuracy: 0.8063 - val_loss: 0.4410
Epoch 5/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 82ms/step - accuracy: 0.8054 - loss: 0.4278 - val_accuracy: 0.8104 - val_loss: 0.4287
Epoch 6/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 80ms/step - accuracy: 0.8089 - loss: 0.4243 - val_accuracy: 0.8155 - val_loss: 0.4191
Epoch 7/10
[1m30/30[0m [32m━━━━

In [25]:
results = model.evaluate(test_data.batch(512), verbose=2)

for name, value in zip(model.metrics_names, results):
  print("%s: %.3f" % (name, value))

49/49 - 3s - 51ms/step - accuracy: 0.8176 - loss: 0.4070
loss: 0.407
compile_metrics: 0.818


11. **For your next experiment, keep using the `TextVectorization` layer to preprocess raw text into integer sequences, but replace the custom `Embedding` layer with a more complex, pretrained word embedding from TensorFlow Hub (such as `BERT or nnlm-en-dim128-with-normalization`). Then, train and evaluate your model on the same dataset. Compare the performance and training time with the previous approach that used a trainable embedding layer.**

In [26]:
# Step 1: Create TextVectorization layer
max_features = 10000          # Vocabulary size limit
sequence_length = 250         # Max tokens per review (adjustable)

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length
)

In [27]:
# Step 2: Adapt vectorize_layer on the training text dataset (extract only texts)
train_text_ds = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_text_ds)

2025-07-10 16:38:32.993880: I tensorflow/core/framework/local_rendezvous.cc:407] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


In [28]:
# Build the model
model = tf.keras.Sequential([
    vectorize_layer,                                    # Convert text to int sequences
    tf.keras.layers.Embedding(                          # Learn 50-dim embeddings
        input_dim=max_features,
        output_dim=50,
        input_length=sequence_length
    ),
    tf.keras.layers.GlobalAveragePooling1D(),           # Average embeddings into one vector
    tf.keras.layers.Dense(16, activation='relu'),       # Hidden layer
    tf.keras.layers.Dense(1, activation='sigmoid')      # Output layer (binary classification)
])

# Build the model by passing a sample batch of string inputs (required for summary to work)
sample_text = tf.constant(["This is a sample input."])
model(sample_text)

# Now print the summary with proper output shapes and param counts
model.summary()




In [29]:
# Step 4: Compile model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

In [30]:
# Step 5: Prepare batched datasets for training and validation
batch_size = 512

train_data_batched = train_data.shuffle(10000).batch(batch_size).prefetch(tf.data.AUTOTUNE)
validation_data_batched = validation_data.batch(batch_size).prefetch(tf.data.AUTOTUNE)

# Step 6: Train the model
history = model.fit(
    train_data_batched,
    epochs=10,
    validation_data=validation_data_batched,
    verbose=1
)

Epoch 1/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 480ms/step - accuracy: 0.5542 - loss: 0.6909 - val_accuracy: 0.6981 - val_loss: 0.6787
Epoch 2/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 500ms/step - accuracy: 0.7052 - loss: 0.6708 - val_accuracy: 0.7368 - val_loss: 0.6445
Epoch 3/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 599ms/step - accuracy: 0.7352 - loss: 0.6289 - val_accuracy: 0.7302 - val_loss: 0.5931
Epoch 4/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 502ms/step - accuracy: 0.7780 - loss: 0.5666 - val_accuracy: 0.8030 - val_loss: 0.5239
Epoch 5/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 513ms/step - accuracy: 0.8205 - loss: 0.4967 - val_accuracy: 0.8247 - val_loss: 0.4670
Epoch 6/10
[1m30/30[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 684ms/step - accuracy: 0.8455 - loss: 0.4302 - val_accuracy: 0.8374 - val_loss: 0.4217
Epoch 7/10
[1m30/30[

In [31]:
# Evaluate on test data
test_data_batched = test_data.batch(batch_size).prefetch(tf.data.AUTOTUNE)
test_loss, test_acc = model.evaluate(test_data_batched)
print(f"Test accuracy: {test_acc:.3f}")

[1m49/49[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m15s[0m 300ms/step - accuracy: 0.8519 - loss: 0.3523
Test accuracy: 0.851
