# Chapter 9: Natural Language Processing with TensorFlow: Sentiment Analysis

## 1️⃣ Chapter Overview

In the previous chapters, we focused on Computer Vision. Now, we switch gears to **Natural Language Processing (NLP)**. NLP is a field of AI focused on enabling computers to understand, interpret, and generate human language. 

In this chapter, we will build a **Sentiment Analysis** model. The goal is to classify Amazon video game reviews as either **Positive** or **Negative** based on the text content. We will move beyond simple dense networks and introduce **Recurrent Neural Networks (RNNs)**, specifically the **Long Short-Term Memory (LSTM)** network, which is designed to handle sequential data like text.

### Key Machine Learning Concepts:
* **Text Preprocessing:** Tokenization, Lemmatization, and Stop-word removal.
* **Text Representation:** One-Hot Encoding vs. Word Embeddings.
* **Sequential Models:** Understanding RNNs and LSTMs.
* **Handling Class Imbalance:** Using class weights during training.

### Practical Skills:
* Building efficient text data pipelines using `tf.data` and `RaggedTensors`.
* Implementing custom Keras layers for text vectorization.
* Training LSTM models for sequence classification.
* Using `tf.keras.layers.Embedding` to learn dense word representations.

## 2️⃣ Theoretical Explanation

### 2.1 NLP Preprocessing
Raw text is messy. Before a model can learn from it, we must clean it. Common steps include:
1.  **Tokenization:** Breaking a sentence into individual words (tokens).
2.  **Stop Word Removal:** Removing common words that add little meaning (e.g., "the", "is", "at"). *Note: In sentiment analysis, words like "not" are crucial and should NOT be removed.*
3.  **Lemmatization:** Converting words to their base root form (e.g., "running" $\rightarrow$ "run", "better" $\rightarrow$ "good"). This reduces the vocabulary size.

### 2.2 Representing Text as Numbers
Models process numbers, not strings. We need to convert tokens into numerical vectors.

#### One-Hot Encoding
Each word is represented by a vector of size $V$ (vocabulary size). It contains a single `1` at the index of the word and `0`s elsewhere.
* *Pros:* Simple.
* *Cons:* High dimensionality, sparse, does not capture semantic similarity (e.g., "cat" and "dog" are as different as "cat" and "car").

#### Word Embeddings
Each word is represented by a dense vector of fixed size $D$ (e.g., 128). These vectors are learned during training.
* *Pros:* Lower dimensionality, captures semantic meaning (e.g., "king" - "man" + "woman" $\approx$ "queen").

### 2.3 Long Short-Term Memory (LSTM)
Standard Feed-Forward networks cannot handle sequential data because they don't have "memory" of previous inputs. **Recurrent Neural Networks (RNNs)** solve this by maintaining a hidden state that passes information from one time step to the next.

However, standard RNNs suffer from the **Vanishing Gradient Problem**, making them forget long-term dependencies. 

**LSTMs** improve on RNNs by introducing a **Cell State** ($C_t$) and three gates:
1.  **Forget Gate:** Decides what information to discard from the cell state.
2.  **Input Gate:** Decides what new information to store in the cell state.
3.  **Output Gate:** Decides what to output based on the cell state.

This allows LSTMs to remember important context over long sequences, making them ideal for processing reviews.

## 3️⃣ Data Preparation

We will use the **Amazon Video Games Review** dataset. We need to download it, clean the text using `NLTK`, and split it into training, validation, and test sets.

In [None]:
import os
import gzip
import shutil
import requests
import pandas as pd
import numpy as np
import tensorflow as tf
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# 1. Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')

# 2. Download Dataset
def download_data():
    if not os.path.exists('data'):
        os.mkdir('data')
    
    url = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz"
    file_path = os.path.join('data', 'Video_Games_5.json.gz')
    json_path = os.path.join('data', 'Video_Games_5.json')
    
    if not os.path.exists(file_path):
        print("Downloading dataset...")
        r = requests.get(url, stream=True)
        with open(file_path, 'wb') as f:
            f.write(r.content)
            
    if not os.path.exists(json_path):
        print("Extracting dataset...")
        with gzip.open(file_path, 'rb') as f_in:
            with open(json_path, 'wb') as f_out:
                shutil.copyfileobj(f_in, f_out)
    return json_path

json_path = download_data()

# 3. Load Data
# We read the JSON file line by line
df = pd.read_json(json_path, lines=True)
df = df[['overall', 'verified', 'reviewText']]

# Filter for verified reviews and remove empty text
df = df[df['verified'] == True]
df = df.dropna(subset=['reviewText'])
df = df[df['reviewText'].apply(lambda x: len(str(x).strip()) > 0)]

# Create Binary Labels
# 4, 5 stars -> 1 (Positive)
# 1, 2, 3 stars -> 0 (Negative/Neutral)
df['label'] = df['overall'].apply(lambda x: 1 if x >= 4 else 0)

# Shuffle data
df = df.sample(frac=1.0, random_state=42).reset_index(drop=True)

print(f"Total reviews: {len(df)}")
print(df['label'].value_counts())

### 3.1 Text Cleaning
We define a function to clean the text. Note that we strictly exclude 'not' and 'no' from the stopwords list because they flip the sentiment.

In [None]:
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english')) - {'not', 'no'}

def clean_text(text):
    # Lowercase
    text = str(text).lower()
    
    # Remove digits
    text = re.sub(r'\d+', '', text)
    
    # Expand contractions (simple heuristic)
    text = text.replace("n't", " not")
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords and punctuation, and lemmatize
    cleaned_tokens = []
    for w in tokens:
        if w not in stop_words and w.isalnum():
            # Lemmatize verbs and nouns
            lemma = lemmatizer.lemmatize(w, pos='v')
            cleaned_tokens.append(lemma)
            
    return " ".join(cleaned_tokens)

# Apply cleaning (Using a subset for speed in this demo, use full df for real training)
df_small = df.iloc[:20000].copy()
print("Cleaning text... (this may take a moment)")
df_small['clean_text'] = df_small['reviewText'].apply(clean_text)

print("Sample cleaned text:")
print(df_small[['reviewText', 'clean_text']].head())

### 3.2 Splitting Data
We split the data into Training (80%), Validation (10%), and Test (10%). We ensure the validation and test sets are balanced to correctly evaluate the model.

In [None]:
from sklearn.model_selection import train_test_split

X = df_small['clean_text'].values
y = df_small['label'].values

# First split: Train vs Temp
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Second split: Val vs Test
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

print(f"Train size: {len(X_train)}")
print(f"Val size: {len(X_val)}")
print(f"Test size: {len(X_test)}")

### 3.3 Tokenization
We use Keras `Tokenizer` to convert text to sequences of integers.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Hyperparameters
VOCAB_SIZE = 10000
MAX_LEN = 100

tokenizer = Tokenizer(num_words=VOCAB_SIZE, oov_token='<UNK>')
tokenizer.fit_on_texts(X_train)

# Convert text to sequences
train_seq = tokenizer.texts_to_sequences(X_train)
val_seq = tokenizer.texts_to_sequences(X_val)
test_seq = tokenizer.texts_to_sequences(X_test)

print(f"Example sequence: {train_seq[0]}")

## 4️⃣ Building the Data Pipeline

We will use `tf.data` to create an efficient pipeline. A key optimization here is **Bucketing**. Instead of padding every sentence to the maximum length (which wastes memory), we group sentences of similar lengths together and pad them to the length of the longest sentence *in that bucket*.

In [None]:
def create_dataset(sequences, labels, batch_size=32, shuffle=True):
    # Create a RaggedTensor (handles variable lengths)
    ragged_data = tf.ragged.constant(sequences)
    data = tf.data.Dataset.from_tensor_slices((ragged_data, labels))

    if shuffle:
        data = data.shuffle(buffer_size=1000)

    # Bucketing logic
    # We define bucket boundaries (e.g., sentences < 10, 10-25, 25-50, >50)
    bucket_boundaries = [10, 25, 50]
    bucket_batch_sizes = [batch_size] * (len(bucket_boundaries) + 1)
    
    # Function to get sequence length
    length_func = lambda x, y: tf.shape(x)[0]
    
    # Transformation
    dataset = data.apply(
        tf.data.experimental.bucket_by_sequence_length(
            element_length_func=length_func,
            bucket_boundaries=bucket_boundaries,
            bucket_batch_sizes=bucket_batch_sizes,
            padded_shapes=(tf.TensorShape([None]), tf.TensorShape([])), # Pad x to variable len, y is scalar
            padding_values=(0, 0),
            drop_remainder=True
        )
    )
    
    return dataset.prefetch(tf.data.AUTOTUNE)

BATCH_SIZE = 64
train_ds = create_dataset(train_seq, y_train, BATCH_SIZE)
val_ds = create_dataset(val_seq, y_val, BATCH_SIZE, shuffle=False)
test_ds = create_dataset(test_seq, y_test, BATCH_SIZE, shuffle=False)

## 5️⃣ Model A: One-Hot Encoding + LSTM

First, we build a baseline model using a custom **One-Hot Encoding** layer. 

**Architecture:**
Input $\rightarrow$ Masking $\rightarrow$ OneHot $\rightarrow$ LSTM $\rightarrow$ Dense $\rightarrow$ Output

We need a **Masking layer** because padded zeros should be ignored by the LSTM.

In [None]:
class OnehotEncoder(tf.keras.layers.Layer):
    def __init__(self, depth, **kwargs):
        super(OnehotEncoder, self).__init__(**kwargs)
        self.depth = depth

    def call(self, inputs):
        # Inputs come in as (batch, seq_len)
        # Cast to int32
        x = tf.cast(inputs, tf.int32)
        # One-hot encode: (batch, seq_len, depth)
        return tf.one_hot(x, depth=self.depth)
    
    def compute_mask(self, inputs, mask=None):
        # Propagate the mask from previous layer
        return mask

    def get_config(self):
        config = super().get_config().copy()
        config.update({'depth': self.depth})
        return config

def build_onehot_model(vocab_size):
    model = tf.keras.Sequential([
        # Masking layer: ignores inputs with value 0
        tf.keras.layers.Masking(mask_value=0, input_shape=(None,)),
        
        # Custom One-Hot layer
        OnehotEncoder(depth=vocab_size),
        
        # LSTM Layer
        # return_sequences=False -> return only the final hidden state
        tf.keras.layers.LSTM(64, return_sequences=False),
        
        # Classifier
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    return model

model_onehot = build_onehot_model(VOCAB_SIZE)
model_onehot.summary()

### 5.1 Handling Class Imbalance
Review datasets are often imbalanced (more positive reviews than negative). We compute **Class Weights** to force the model to pay more attention to the minority class.

In [None]:
neg_count = np.sum(y_train == 0)
pos_count = np.sum(y_train == 1)
total = len(y_train)

# Weight for class 0
weight_0 = (1 / neg_count) * (total / 2.0)
# Weight for class 1
weight_1 = (1 / pos_count) * (total / 2.0)

class_weights = {0: weight_0, 1: weight_1}
print(f"Class Weights: {class_weights}")

# Compile
model_onehot.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### 5.2 Training Model A
*(Note: We train for fewer epochs here for demonstration purposes)*

In [None]:
history_onehot = model_onehot.fit(
    train_ds,
    validation_data=val_ds,
    epochs=3,
    class_weight=class_weights
)

## 6️⃣ Model B: Embeddings + LSTM

One-hot encoding creates huge, sparse vectors. **Embeddings** are superior because they are dense and learned. We replace the `OnehotEncoder` with `tf.keras.layers.Embedding`.

In [None]:
def build_embedding_model(vocab_size):
    model = tf.keras.Sequential([
        # Embedding Layer
        # mask_zero=True acts as the Masking layer automatically
        tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=128, mask_zero=True),
        
        # LSTM
        tf.keras.layers.LSTM(64),
        
        # Classifier
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    return model

model_emb = build_embedding_model(VOCAB_SIZE)
model_emb.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_emb.summary()

### 6.1 Training Model B

In [None]:
history_emb = model_emb.fit(
    train_ds,
    validation_data=val_ds,
    epochs=3,
    class_weight=class_weights
)

## 7️⃣ Step-by-Step Explanation

### 1. Data Pipeline Construction
* **Input:** We have variable-length reviews (lists of integers).
* **Process:** We use `tf.ragged.constant` to handle the ragged edges. The `bucket_by_sequence_length` function groups sequences into buckets (e.g., small, medium, large). In a batch of small sequences, we only pad to the size of the longest small sequence, saving massive computation.
* **Output:** A `tf.data.Dataset` yielding batches of `(input, label)`.

### 2. LSTM Mechanism
* **Input:** A sequence of vectors (either one-hot or embeddings).
* **Process:** The LSTM loops through the sequence. At each step $t$, it looks at the current word $x_t$ and its previous memory (hidden state $h_{t-1}$ and cell state $C_{t-1}$). It decides what to remember and what to forget using gates.
* **Output:** We use `return_sequences=False`, so it outputs only the final state $h_T$ after seeing the whole review. This final vector summarizes the sentiment.

### 3. Embedding vs One-Hot
* **One-Hot:** The vector size is 10,000 (vocab size). It is mostly zeros.
* **Embedding:** The vector size is 128. It is dense. The model learns that words like "good" and "great" should have similar vectors (be close in vector space), which helps generalization.

## 8️⃣ Chapter Summary

* **NLP Pipeline:** Cleaning text (NLTK) $\rightarrow$ Tokenizing (Keras) $\rightarrow$ Bucketing & Batching (`tf.data`).
* **Masking:** Crucial for variable-length sequences. It tells the model to ignore the zeros added for padding.
* **LSTM:** A powerful RNN variant that mitigates the vanishing gradient problem, allowing it to learn long-term dependencies in text.
* **Embeddings:** Learned dense representations of words are far superior to One-Hot encoding for deep learning models.
* **Imbalanced Data:** Using class weights ensures the model doesn't just memorize the majority class (Positive reviews) but also learns to detect Negative reviews.