# Chapter 9: Natural language processing with TensorFlow: Sentiment analysis

This notebook reproduces the code and summarizes the theoretical concepts from Chapter 9 of *'TensorFlow in Action'* by Thushan Ganegedara.

This chapter builds a sentiment analyzer for video game reviews. The key topics covered are:
1.  **Text Preprocessing**: Cleaning and preparing raw text data for a model using NLTK.
2.  **Data Analysis**: Analyzing vocabulary size and sequence length to inform model design.
3.  **Data Pipelines**: Using Keras `Tokenizer` and the `tf.data` API to create an efficient pipeline for text.
4.  **Sequential Modeling**: Implementing a **Long Short-Term Memory (LSTM)** network to classify sentiment.
5.  **Word Embeddings**: Improving the model by replacing one-hot encoding with a trainable `Embedding` layer.

---

## 9.1 What the text? Exploring and processing text

The first step is to load our data and perform **Exploratory Data Analysis (EDA)**. We will be using the "Video Games 5-core" dataset from Amazon reviews. This dataset is in JSON format.

We will perform the following preprocessing steps:
1.  Load the data using `pandas`.
2.  Filter for **verified reviews** only, to ensure data quality.
3.  Map the 1-5 star ratings (`overall`) to a binary sentiment label: `1` (positive) for 4-5 stars, and `0` (negative) for 1-3 stars.
4.  Observe the **class imbalance** (many more positive reviews than negative ones).

In [15]:
import os
import requests
import gzip
import shutil
import pandas as pd
import numpy as np
import tensorflow as tf
import re
import string
from collections import Counter

# Set a random seed for reproducibility
random_seed = 42
np.random.seed(random_seed)
tf.random.set_seed(random_seed)

# --- 1. Download and Extract Data (Listing 9.1) ---
data_dir = 'data'
json_gz_path = os.path.join(data_dir, 'Video_Games_5.json.gz')
json_path = os.path.join(data_dir, 'Video_Games_5.json')

if not os.path.exists(json_path):
    os.makedirs(data_dir, exist_ok=True)
    if not os.path.exists(json_gz_path):
        print("Downloading Video_Games_5.json.gz (50MB)...")
        url = "https://gitlab.eecs.wsu.edu/2018080000/475datascience/-/raw/master/Video_Games_5.json.gz"
        r = requests.get(url)
        with open(json_gz_path, 'wb') as f:
            f.write(r.content)
        print("Download complete.")

    print("Extracting data...")
    with gzip.open(json_gz_path, 'rb') as f_in:
        with open(json_path, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
    print("Extraction complete.")
else:
    print("Data already downloaded and extracted.")

# --- 2. Load and Pre-filter Data ---
review_df = pd.read_json(json_path, lines=True, orient='records')
review_df = review_df[["overall", "verified", "reviewTime", "reviewText"]]

# Remove records with empty reviewText
review_df = review_df[~review_df["reviewText"].isna()]
review_df = review_df[review_df["reviewText"].str.strip().str.len() > 0]

# Filter for verified reviews only
verified_df = review_df.loc[review_df["verified"], :].copy()

# --- 3. Map to Binary Labels ---
verified_df["label"] = verified_df["overall"].map({5: 1, 4: 1, 3: 0, 2: 0, 1: 0})

# --- 4. Observe Class Imbalance ---
print("\nClass distribution:")
print(verified_df["label"].value_counts())

# Shuffle and separate inputs and labels
verified_df = verified_df.sample(frac=1.0, random_state=random_seed)
inputs, labels = verified_df["reviewText"], verified_df["label"]

Data already downloaded and extracted.

Class distribution:
label
1    277213
0     55291
Name: count, dtype: int64


### Text Cleaning with NLTK

Raw text is very noisy. We will use the **Natural Language Toolkit (NLTK)** library to clean it. Our `clean_text` function will perform several key operations:

1.  **Lowercase**: Converts all text to lowercase.
2.  **Contraction Expansion**: Expands `n't` to ` not ` (crucial for sentiment).
3.  **Tokenization**: Splits the text into a list of individual words (tokens).
4.  **Stop Word Removal**: Removes common, uninformative words like "the", "a", "is". We explicitly *keep* "not" and "no" as they are vital for sentiment.
5.  **Punctuation/Number Removal**: Removes all punctuation and digits.
6.  **Lemmatization**: Reduces words to their base or dictionary form (e.g., "walking" -> "walk", "was" -> "be"). This requires Part-of-Speech (PoS) tagging to know if a word is a noun, verb, etc.

In [16]:
import nltk

# Create NLTK directory if it doesn't exist
nltk_dir = os.path.abspath('nltk')
os.makedirs(nltk_dir, exist_ok=True) # Ensure directory exists

# Append to NLTK data path so NLTK knows where to look for downloaded data
nltk.data.path.append(nltk_dir)

# Download NLTK resources needed for tokenizing, PoS tagging, and lemmatizing
# NLTK's download function intelligently skips already downloaded resources.
nltk.download('averaged_perceptron_tagger', download_dir='nltk')
nltk.download('averaged_perceptron_tagger_eng', download_dir='nltk') # Added specific tagger
nltk.download('wordnet', download_dir='nltk')
nltk.download('omw-1.4', download_dir='nltk')
nltk.download('stopwords', download_dir='nltk')
nltk.download('punkt', download_dir='nltk')
nltk.download('punkt_tab', download_dir='nltk')

from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

# Keep 'not' and 'no' as they are important for sentiment
EN_STOPWORDS = set(stopwords.words('english')) - {'not', 'no'}
lemmatizer = WordNetLemmatizer()

# Define the text cleaning function (based on Listing 9.2)
def clean_text(doc):
    """Cleans a given text document."""

    doc = doc.lower()
    doc = doc.replace("n't ", ' not ')
    doc = re.sub(r"(?:'ll |'re |'d |'ve )", " ", doc)
    doc = re.sub(r"\d+", "", doc)

    # Tokenize and remove stopwords/punctuation
    tokens = [w for w in word_tokenize(doc)
              if w not in EN_STOPWORDS and w not in string.punctuation]

    # Get Part-of-Speech tags
    pos_tags = nltk.pos_tag(tokens)

    # Lemmatize based on PoS tag (only for Nouns and Verbs)
    clean_text = [
        lemmatizer.lemmatize(w, pos=p[0].lower()) if p[0] in ['N', 'V'] else w
        for (w, p) in pos_tags
    ]
    return clean_text

# Test the function
sample_doc = "It\u2019s an okay game. I am always dying, which depresses me. I can't play this."
print(f"Original: {sample_doc}")
print(f"Cleaned: {clean_text(sample_doc)}")

Original: It’s an okay game. I am always dying, which depresses me. I can't play this.
Cleaned: ['’', 'okay', 'game', 'always', 'die', 'depresses', 'ca', 'not', 'play']


[nltk_data] Downloading package averaged_perceptron_tagger to nltk...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     nltk...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to nltk...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to nltk...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to nltk...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to nltk...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to nltk...
[nltk_data]   Package punkt_tab is already up-to-date!


In [17]:
# This step can take a long time (up to an hour)
cleaned_inputs_path = os.path.join(data_dir, 'sentiment_inputs.pkl')
labels_path = os.path.join(data_dir, 'sentiment_labels.pkl')

if not os.path.exists(cleaned_inputs_path):
    print("Cleaning all review texts... (This may take a long time)")
    inputs = inputs.apply(lambda x: clean_text(x))
    # Save the cleaned data to avoid re-running this step
    inputs.to_pickle(cleaned_inputs_path)
    labels.to_pickle(labels_path)
    print("Cleaning and saving complete.")
else:
    print("Loading pre-cleaned data...")
    inputs = pd.read_pickle(cleaned_inputs_path)
    labels = pd.read_pickle(labels_path)

print("\nFirst 5 cleaned reviews:")
print(inputs.head())

Loading pre-cleaned data...

First 5 cleaned reviews:
128841    [seem, bad, luck, play, charge, cable, work, s...
422950    [great, stick, box, everything, fine, no, dama...
403141                                    [perfect, thanks]
268147    [three, different, grip, ps, vita, favorite, h...
183205    [son, want, christmas, not, search, around, to...
Name: reviewText, dtype: object


## 9.2 Getting text ready for the model

Now that our text is clean, we need to convert it into numbers. Before we do, we must:
1.  **Split Data**: Create training, validation, and test sets. **Crucially**, we will create *balanced* validation and test sets to get a reliable accuracy score. The remaining data (which will be imbalanced) will be our training set.
2.  **Analyze Training Data**: We will analyze the *training data only* to find our `n_vocab` (vocabulary size) and decide on sequence lengths. Analyzing only the training data prevents **data leakage**.
3.  **Tokenize**: Use the Keras `Tokenizer` to build a vocabulary (word -> ID mapping) from the training data and then convert all three datasets into sequences of integers.

In [18]:
# 1. Split data (based on Listing 9.3)
def train_valid_test_split(inputs, labels, train_fraction=0.8):
    neg_indices = pd.Series(labels.loc[(labels == 0)].index)
    pos_indices = pd.Series(labels.loc[(labels == 1)].index)

    # Create balanced validation and test sets
    n_valid = int(min([len(neg_indices), len(pos_indices)]) * ((1 - train_fraction) / 2.0))
    n_test = n_valid

    neg_test_inds = neg_indices.sample(n=n_test, random_state=random_seed)
    neg_valid_inds = neg_indices.loc[~neg_indices.isin(neg_test_inds)].sample(n=n_test, random_state=random_seed)
    neg_train_inds = neg_indices.loc[~neg_indices.isin(neg_test_inds.tolist() + neg_valid_inds.tolist())]

    pos_test_inds = pos_indices.sample(n=n_test, random_state=random_seed)
    pos_valid_inds = pos_indices.loc[~pos_indices.isin(pos_test_inds)].sample(n=n_test, random_state=random_seed)
    pos_train_inds = pos_indices.loc[~pos_indices.isin(pos_test_inds.tolist() + pos_valid_inds.tolist())]

    tr_x = inputs.loc[neg_train_inds.tolist() + pos_train_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    tr_y = labels.loc[neg_train_inds.tolist() + pos_train_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    v_x = inputs.loc[neg_valid_inds.tolist() + pos_valid_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    v_y = labels.loc[neg_valid_inds.tolist() + pos_valid_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    ts_x = inputs.loc[neg_test_inds.tolist() + pos_test_inds.tolist()].sample(frac=1.0, random_state=random_seed)
    ts_y = labels.loc[neg_test_inds.tolist() + pos_test_inds.tolist()].sample(frac=1.0, random_state=random_seed)

    print(f"Training data: {len(tr_x)} (Imbalanced)")
    print(tr_y.value_counts())
    print(f"Validation data: {len(v_x)} (Balanced)")
    print(v_y.value_counts())
    print(f"Test data: {len(ts_x)} (Balanced)")
    print(ts_y.value_counts())

    return (tr_x, tr_y), (v_x, v_y), (ts_x, ts_y)

(tr_x, tr_y), (v_x, v_y), (ts_x, ts_y) = train_valid_test_split(inputs, labels, train_fraction=0.8)

# 2. Analyze Vocabulary
data_list = [w for doc in tr_x for w in doc]
cnt = Counter(data_list)
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)

# We'll set our vocabulary to words that appear at least 25 times
n_vocab = (freq_df >= 25).sum()
print(f"\nVocabulary size (words appearing >= 25 times): {n_vocab}")

# 3. Analyze Sequence Length
seq_length_ser = tr_x.str.len()
print("\nSequence Length Statistics (90% of data):")
p_10 = seq_length_ser.quantile(0.1)
p_90 = seq_length_ser.quantile(0.9)
print(seq_length_ser[(seq_length_ser >= p_10) & (seq_length_ser < p_90)].describe(percentiles=[0.33, 0.66]))
# We will use [5, 15] as bucket boundaries based on the 33% and 66% percentiles.
bucket_boundaries = [5, 15]

Training data: 310388 (Imbalanced)
label
1    266155
0     44233
Name: count, dtype: int64
Validation data: 11058 (Balanced)
label
1    5529
0    5529
Name: count, dtype: int64
Test data: 11058 (Balanced)
label
1    5529
0    5529
Name: count, dtype: int64

Vocabulary size (words appearing >= 25 times): 11388

Sequence Length Statistics (90% of data):
count    278577.000000
mean         15.258334
std          16.047219
min           1.000000
33%           5.000000
50%          10.000000
66%          16.000000
max          73.000000
Name: reviewText, dtype: float64


In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer

# 4. Tokenize the text
tokenizer = Tokenizer(
    num_words=n_vocab, # Use the vocab size we calculated
    oov_token='unk'    # Token for out-of-vocabulary words
)

# Fit the tokenizer on the training data ONLY
tokenizer.fit_on_texts(tr_x.tolist())

# Convert all three datasets to integer sequences
tr_x_seq = tokenizer.texts_to_sequences(tr_x.tolist())
v_x_seq = tokenizer.texts_to_sequences(v_x.tolist())
ts_x_seq = tokenizer.texts_to_sequences(ts_x.tolist())

print("\nExample of original text:")
print(tr_x.iloc[0])
print("\nExample of tokenized sequence:")
print(tr_x_seq[0])


Example of original text:
['fun', 'fun', 'fun']

Example of tokenized sequence:
[15, 15, 15]


---

## 9.3 Defining an end-to-end NLP pipeline with TensorFlow

We now have lists of variable-length integer sequences. To feed this to a model efficiently, we use `tf.data`.

Our pipeline will:
1.  Use `tf.ragged.constant` to handle the variable-length sequences.
2.  Use `tf.data.Dataset.from_tensor_slices` to create a dataset.
3.  Use `tf.data.experimental.bucket_by_sequence_length` to group sequences of similar lengths into batches. This is **much more efficient** than padding all sentences to the same maximum length, as it minimizes the amount of padding per batch.
4.  Split the data into `(inputs, labels)`.

In [20]:
# Define the pipeline function (based on Listing 9.4)
def get_tf_pipeline(text_seq, labels, batch_size=64, bucket_boundaries=[5, 15], max_length=50, shuffle=False):

    # 1. Combine labels and inputs (labels are prepended)
    data_seq = [[b] + a for a, b in zip(text_seq, labels.values)]

    # 2. Create a RaggedTensor and truncate to max_length
    tf_data = tf.ragged.constant(data_seq, dtype=tf.int32)[:, :max_length]

    # 3. Create the dataset
    text_ds = tf.data.Dataset.from_tensor_slices(tf_data)

    # 4. Filter empty sequences
    text_ds = text_ds.filter(lambda x: tf.size(x) > 1)

    # 5. Define the bucketing function
    bucket_fn = tf.data.experimental.bucket_by_sequence_length(
        element_length_func=lambda x: tf.cast(tf.shape(x)[0], 'int32'),
        bucket_boundaries=bucket_boundaries,
        bucket_batch_sizes=[batch_size] * (len(bucket_boundaries) + 1),
        padding_values=0,
        pad_to_bucket_boundary=False
    )

    # 6. Apply bucketing and shuffling
    text_ds = text_ds.map(lambda x: x).apply(bucket_fn)
    if shuffle:
        text_ds = text_ds.shuffle(buffer_size=10 * batch_size)

    # 7. Split back into (inputs, labels)
    # x[:, 0] is the label, x[:, 1:] is the input sequence
    text_ds = text_ds.map(lambda x: (x[:, 1:], x[:, 0]))

    return text_ds

# Create the train and validation datasets
batch_size = 128
train_ds = get_tf_pipeline(tr_x_seq, tr_y, batch_size=batch_size, shuffle=True)
valid_ds = get_tf_pipeline(v_x_seq, v_y, batch_size=batch_size)
test_ds = get_tf_pipeline(ts_x_seq, ts_y, batch_size=batch_size)

# Inspect a batch
for x_batch, y_batch in train_ds.take(1):
    print(f"X batch shape: {x_batch.shape}")
    print(f"Y batch shape: {y_batch.shape}")
    print(f"Example X: {x_batch[0, :10]}...")
    print(f"Example Y: {y_batch[0]}")

X batch shape: (128, 3)
Y batch shape: (128,)
Example X: [63  0  0]...
Example Y: 1


---

## 9.4 & 9.5: Sentiment Analysis Model (LSTM) & Training

We will now build our sentiment analysis model using an **LSTM (Long Short-Term Memory)** network.

**Why LSTM?** LSTMs are a type of RNN designed to overcome the vanishing gradient problem. They can learn long-range dependencies (e.g., remember the word "not" from the beginning of a long review) by using a series of "gates" (input, forget, output) to control a persistent cell state (its memory).

Our first model (Listing 9.5) will use:
1.  A `Masking` layer to ignore the `0` padding values.
2.  A custom `OnehotEncoder` layer to convert integer IDs to one-hot vectors.
3.  An `LSTM` layer.
4.  `Dense` layers for classification.

In [22]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, CSVLogger
from tensorflow.keras import backend as K_

# Custom OnehotEncoder layer (Listing 9.5)
class OnehotEncoder(tf.keras.layers.Layer):
    def __init__(self, depth, **kwargs):
        super(OnehotEncoder, self).__init__(**kwargs)
        self.depth = depth

    def call(self, inputs):
        inputs = tf.cast(inputs, 'int32')
        if len(inputs.shape) == 3:
            inputs = inputs[:, :, 0]
        return tf.one_hot(inputs, depth=self.depth)

    def compute_mask(self, inputs, mask=None):
        return mask

    def get_config(self):
        config = super().get_config().copy()
        config.update({'depth': self.depth})
        return config

K_.clear_session()

# 1. Define the model (Listing 9.5)
model_onehot = tf.keras.models.Sequential([
    # Note: The input_shape is (None, 1) - (sequence_length, features)
    # We use (None,) for variable sequence length, but the pipeline adds the feature dim.
    # For simplicity, we'll let the model infer from the first batch,
    # or specify input_shape=(None,) for the Embedding model later.
    # Here, we specify the input_shape required by the Masking layer
    tf.keras.layers.Masking(mask_value=0.0, input_shape=(None, 1)),
    OnehotEncoder(depth=n_vocab + 1), # +1 for the 'unk' token
    tf.keras.layers.LSTM(128, return_state=False, return_sequences=False),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid') # Binary classification
])

# 2. Compile the model
model_onehot.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_onehot.summary()

# 3. Calculate class weight to handle imbalance
neg_weight = (tr_y == 1).sum() / (tr_y == 0).sum()
print(f"\nNegative class weight: {neg_weight:.2f}")

# 4. Define callbacks
os.makedirs('eval', exist_ok=True)
csv_logger = CSVLogger(os.path.join('eval', '1_sentiment_analysis_onehot.log'))
es_callback = EarlyStopping(monitor='val_loss', patience=6, mode='min')
lr_callback = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, mode='min')

# 5. Train the model (Listing 9.6)
print("\nTraining model with One-Hot Encoder...")
history_onehot = model_onehot.fit(
    train_ds,
    validation_data=valid_ds,
    epochs=2, # Book runs for 10, we'll use 2 for speed
    class_weight={0: neg_weight, 1: 1.0},
    callbacks=[es_callback, lr_callback, csv_logger]
)

# 6. Evaluate
print("\nEvaluating One-Hot model on test set...")
model_onehot.evaluate(test_ds)

  super().__init__(**kwargs)



Negative class weight: 6.02

Training model with One-Hot Encoder...
Epoch 1/2
   2423/Unknown [1m4794s[0m 2s/step - accuracy: 0.7844 - loss: 0.8036



[1m2423/2423[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4896s[0m 2s/step - accuracy: 0.7845 - loss: 0.8036 - val_accuracy: 0.8171 - val_loss: 0.3986 - learning_rate: 0.0010
Epoch 2/2
[1m2423/2423[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4897s[0m 2s/step - accuracy: 0.8473 - loss: 0.5974 - val_accuracy: 0.8342 - val_loss: 0.3779 - learning_rate: 0.0010

Evaluating One-Hot model on test set...
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m97s[0m 1s/step - accuracy: 0.8273 - loss: 0.3859


[0.37712302803993225, 0.8323673009872437]

---

## 9.6 Injecting Semantics with Word Embeddings

One-hot encoding is simple but has two major flaws:
1.  **Inefficient**: The vectors are huge (size `n_vocab`) and sparse (mostly zeros).
2.  **Lacks Semantics**: The vectors for "good" and "great" are just as different as the vectors for "good" and "terrible". The model has to learn all semantic relationships from scratch.

**Word Embeddings** solve this. An `Embedding` layer is a lookup table that maps each integer ID to a dense, low-dimensional vector (e.g., 128 dimensions). These vectors are *trainable*, so the model learns to place words with similar meanings close together in the vector space.

In [23]:
K_.clear_session()

# 1. Define the model with an Embedding layer (Listing 9.7)
model_embed = tf.keras.models.Sequential([
    # The Embedding layer takes integer IDs.
    # input_dim = vocab size + 1 (for 0 padding)
    # output_dim = size of the dense vector (e.g., 128)
    # mask_zero=True automatically handles the padding (0) values.
    tf.keras.layers.Embedding(input_dim=n_vocab + 1,
                              output_dim=128,
                              mask_zero=True,
                              input_shape=(None,)), # (batch_size, sequence_length)

    tf.keras.layers.LSTM(128, return_state=False, return_sequences=False),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# 2. Compile the model
model_embed.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_embed.summary()

# 3. Define callbacks
csv_logger_embed = CSVLogger(os.path.join('eval', '2_sentiment_analysis_embed.log'))
es_callback_embed = EarlyStopping(monitor='val_loss', patience=6, mode='min')
lr_callback_embed = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=3, mode='min')

# 4. Train the model
print("\nTraining model with Embedding Layer...")
history_embed = model_embed.fit(
    train_ds,
    validation_data=valid_ds,
    epochs=2, # Book runs for 10
    class_weight={0: neg_weight, 1: 1.0},
    callbacks=[es_callback_embed, lr_callback_embed, csv_logger_embed]
)

# 5. Evaluate
print("\nEvaluating Embedding model on test set...")
model_embed.evaluate(test_ds)

# 6. Save the final model
os.makedirs('models', exist_ok=True)
model_embed.save(os.path.join('models', '1_sentiment_analysis_embed.h5'))

  super().__init__(**kwargs)



Training model with Embedding Layer...
Epoch 1/2
   2422/Unknown [1m326s[0m 129ms/step - accuracy: 0.7650 - loss: 0.7957



[1m2423/2423[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m333s[0m 132ms/step - accuracy: 0.7650 - loss: 0.7956 - val_accuracy: 0.8318 - val_loss: 0.3820 - learning_rate: 0.0010
Epoch 2/2
[1m2423/2423[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m324s[0m 130ms/step - accuracy: 0.8480 - loss: 0.5933 - val_accuracy: 0.8396 - val_loss: 0.3718 - learning_rate: 0.0010

Evaluating Embedding model on test set...
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 64ms/step - accuracy: 0.8330 - loss: 0.3796




In [24]:
# 7. Inspect predictions
print("\n--- Inspecting Model Predictions ---")
test_x_list = ts_x.tolist()
test_y_list = ts_y.tolist()

# Get all predictions from the test set
test_ds_unbatched = get_tf_pipeline(ts_x_seq, ts_y, batch_size=128)
test_pred = model_embed.predict(test_ds_unbatched)

# Get indices of the sorted predictions
sorted_pred_idx = np.argsort(test_pred.flatten())

# Get top 5 most negative reviews
min_pred_idx = sorted_pred_idx[:5]
print("\nMost Negative Reviews (as predicted by model):")
print("="*50)
for i in min_pred_idx:
    print(f"Predicted: {test_pred[i][0]:.3f} | Actual: {test_y_list[i]}")
    print(" ".join(test_x_list[i]), "\n")

# Get top 5 most positive reviews
max_pred_idx = sorted_pred_idx[-5:]
print("\nMost Positive Reviews (as predicted by model):")
print("="*50)
for i in max_pred_idx:
    print(f"Predicted: {test_pred[i][0]:.3f} | Actual: {test_y_list[i]}")
    print(" ".join(test_x_list[i]), "\n")


--- Inspecting Model Predictions ---
[1m88/88[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 49ms/step

Most Negative Reviews (as predicted by model):
Predicted: 0.000 | Actual: 0
product not us region natively asian version pch- though normally play no region lock us account issue device deliver not us ver asian ver 

Predicted: 0.000 | Actual: 0
game `` look '' amaze hardware gun play seriously `` meh '' game play `` meh '' 

Predicted: 0.000 | Actual: 0
virus virus virus please careful completely shut pc buy new one 

Predicted: 0.000 | Actual: 0
boring 

Predicted: 0.001 | Actual: 0
's good 


Most Positive Reviews (as predicted by model):
Predicted: 1.000 | Actual: 0
not like expect something like pacman come playstation disappoint not make pacman/sonic game like use ... 

Predicted: 1.000 | Actual: 1
game new describe accept get platinum hit picture show original kinda suck still good deal 

Predicted: 1.000 | Actual: 1
work good playing could not seem keep attention not

# Task
The `punkt_tab` NLTK resource needs to be downloaded to resolve the `LookupError`. The current code only attempts to download it if the NLTK data directory does not exist, which can lead to issues if the directory exists but the resource is missing.

To fix this, I will modify the code cell `xJOlnyuHwdoV` to ensure that `nltk.download('punkt_tab', download_dir='nltk')` (along with other necessary NLTK downloads) is always attempted, relying on NLTK's internal checks to skip re-downloading already present resources. I will also ensure the NLTK data path is appended before the downloads.

```python
# Create NLTK directory if it doesn't exist
nltk_dir = os.path.abspath('nltk')
os.makedirs(nltk_dir, exist_ok=True) # Ensure directory exists

# Append to NLTK data path so NLTK knows where to look for downloaded data
nltk.data.path.append(nltk_dir)

# Download NLTK resources needed for tokenizing, PoS tagging, and lemmatizing
# NLTK's download function intelligently skips already downloaded resources.
nltk.download('averaged_perceptron_tagger', download_dir='nltk')
nltk.download('wordnet', download_dir='nltk')
nltk.download('omw-1.4', download_dir='nltk')
nltk.download('stopwords', download_dir='nltk')
nltk.download('punkt', download_dir='nltk')
nltk.download('punkt_tab', download_dir='nltk')
```

## modify_cells

### Subtask:
Modify the code cell to include the download of the missing 'punkt_tab' NLTK resource.


## Summary:

### Data Analysis Key Findings
*   A `LookupError` for the `punkt_tab` NLTK resource was identified as the core problem. This error occurred because the original code only attempted to download NLTK resources if the NLTK data directory did not exist, leading to issues if the directory was present but the specific resource was missing.
*   The solution involved modifying the code to unconditionally attempt the download of `punkt_tab` and other necessary NLTK resources (`averaged_perceptron_tagger`, `wordnet`, `omw-1.4`, `stopwords`, `punkt`).
*   The `nltk.data.path` was appended with the specified NLTK download directory (`nltk`) to ensure NLTK correctly locates the downloaded resources.
*   The `os.makedirs(nltk_dir, exist_ok=True)` function was used to create the NLTK directory if it didn't already exist, ensuring a valid path for downloads.

### Insights or Next Steps
*   By ensuring NLTK downloads are always attempted and the data path is correctly configured, the system becomes more robust against missing NLTK resources, regardless of the initial directory state.
*   The solution leverages NLTK's built-in intelligence to skip re-downloading already present resources, making the process efficient despite the unconditional download attempts.
