# **COMP9727 Recommender Systems**
## Tutorial Week 5: Sentiment Analysis

@Author: **Mingqin Yu**

@Reviewer: **Wayne Wobcke**

### Objective

The aim of the tutorial is to study the use of a neural network architecture, in particular Word2Vec with a Multi-Layer Perceptron, for sentiment analysis, a widely used technology underpinning recommender systems.

### Before the tutorial

1. Review the lecture material on Word2Vec, Multi-Layer Perceptrons and backpropagation.

2. Read, understand and run all the code in the Neural Network pipeline below, and come prepared to discuss the answers to some of the questions.

## Neural Network Pipeline

1. **Environment Setup**
   - Install libraries and tools
   - Load and decode the IMDb dataset
   - Tokenize the reviews
2. **Word2Vec Word Embeddings**
   - Train the Word2Vec model
   - Visualize word embeddings with t-SNE
3. **Multi-Layer Perceptron Architecture**
   - Preprocessing for neural network
   - Define neural network architecture
   - Train the model and visualize performance

**Specific Learning Objectives**

1. **Understand**
   - How to buid a neural network model for word embeddings
   - A possible application of neural networks for sentiment analysis
2. **Apply**
   - Use a deep learning package for defining layered neural network architectures
   - Use word embeddings as an input leyer in a neural network model
   - Use a multi-layer perceptron for classification
3. **Analyse**
   - Interpret training and validation set accuracy
4. **Evaluate**
   - The effectiveness of the sentiment classifier
   - Visualize and interpret the word embeddings
   - Visualize the results at each stage of the tutorial for deeper comprehension

## Neural Network Pipeline

### 1. Environment Setup

__Step 1. Install libraries and packages__

The following libraries and datasets are used
- `Word2Vec` from `gensim` is used to generate word embeddings
- `imdb` dataset from `keras`, which is a set of movie reviews labeled as positive or negative
- `pad_sequences` is a utility function from `keras` used to ensure input sequences have the same length

In [None]:
from gensim.models import Word2Vec
from keras.datasets import imdb
from keras.utils import pad_sequences

__Step 2. Load and Decode the IMDb Dataset__

In this section, you are working with the IMDb movie reviews dataset

**Load the IMDb dataset**
- **`imdb.load_data(num_words=10000)`**
    - This function fetches the IMDb dataset from `keras`
    - The `num_words=10000` parameter specifies that you only want to load the top 10,000 most frequent words from the dataset, which is used or reducing the dataset's size and ensuring the model focuses on the most relevant words
- **`imdb.get_word_index()`**
    - This function retrieves the dictionary mapping of words to their corresponding index in the dataset
  
**Decode the reviews**
- Movies reviews in the dataset are represented as sequences of integers. To understand and work with these reviews, they need to be converted to a textual format
- `word_index` maps integers to their respective words. Note that the indices are offset by 3 because 0, 1, and 2 are reserved indices for "padding", "start of sequence" and "unknown"

**Tokenize the reviews**
- The reviews are broken down into individual words, which is essential for the Word2Vec model
- `review.split()` breaks the review string into a list of word

In [None]:
# Load keras IMDb dataset
(X_train_raw, y_train), (X_test_raw, y_test) = imdb.load_data(num_words=10000)

# Decode reviews
word_index = imdb.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_reviews = [' '.join([reverse_word_index.get(i - 3, '?') for i in review]) for review in X_train_raw]

# Tokenize reviews
tokenized_reviews = [review.split() for review in decoded_reviews]

### 2. Word2Vec Word Embeddings

Word embeddings represent words in a dense vector format where semantically similar words are closer in the vector space

__Step 1. Train the Word2Vec model__

- **`Word2Vec()`**
  - `sentences=tokenized_reviews`: Pass the tokenized movie reviews to the model
  - `vector_size=100`: Each word will be represented as a 100-dimensional vector
  - `window=5`: The model will look at a window of 5 words at a time (i.e. the current word and 2 words on each side) to learn the representations
  - `min_count=1`: Words that appear only once in the entire dataset will also be considered
  - `workers=4`: Uses 4 CPU cores to speed up model training

In [None]:
# Train Word2Vec model
model_w2v = Word2Vec(sentences=tokenized_reviews, vector_size=100, window=5, min_count=1, workers=4)

__Step 2. Visualize Word Embeddings__

t-SNE (t-Distributed Stochastic Neighbor Embedding) is a popular algorithm to visualize high-dimensional data in 2D or 3D

- **t-SNE**:
  - First, select a sample of words
  - Convert their embeddings into 2D coordinates using t-SNE
  - Plot these coordinates to see how words relate to each other

In [None]:
import numpy as np
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Select sample words
sample_words = ['film', 'actor', 'cast', 'great', 'good', 'bad', 'awful', 'terrible', 'boring']
sample_vectors = [model_w2v.wv[word] for word in sample_words if word in model_w2v.wv.index_to_key]

# Convert list of vectors to a 2D array
sample_vectors = np.array(sample_vectors)

# Set perplexity to less than the number of sample_vectors
perplexity_value = len(sample_vectors) - 1

# Use t-SNE with adjusted perplexity
tsne = TSNE(n_components=2, perplexity=perplexity_value, random_state=0)
coordinates = tsne.fit_transform(sample_vectors)

# Plot
plt.figure(figsize=(8, 8))
for i, word in enumerate(sample_words[:len(sample_vectors)]):  # Adjust the loop to the number of vectors
    plt.scatter(coordinates[i, 0], coordinates[i, 1])
    plt.annotate(word, xy=(coordinates[i, 0], coordinates[i, 1]), xytext=(5, 2),
                 textcoords='offset points', ha='right', va='bottom')
plt.show()

### 3. Multi-Layer Perceptron Architecture 

__Step 1. Preprocessing for Neural Network__

Neural networks require inputs to be of consistent size. Thus, movie reviews need to be of the same length.
- **`pad_sequences`**
  - This function ensures that all sequences in a list have the same length, either by padding them with a value (default is 0) or truncating them
  - `maxlen=500`: Reviews longer than 500 words will be truncated, and shorter ones will be padded
- **Embedding matrix**
  - An embedding matrix is a weight matrix that will be loaded into the `keras` embedding layer using the vectors learned by the Word2Vec model

In [None]:
# Define maximum sequence length
maxlen = 500
X_train = pad_sequences(X_train_raw, maxlen=maxlen)
X_test = pad_sequences(X_test_raw, maxlen=maxlen)

# Prepare embedding matrix
embedding_matrix = np.zeros((10000, 100))
for word, i in word_index.items():
    if word in model_w2v.wv.index_to_key:
        embedding_matrix[i] = model_w2v.wv[word]

__Step 2. Define Neural Network Architecture__

Construct a neural network in `keras` to predict the sentiment of movie reviews

- **Model layers**
  - `Embedding layer`: Convert word indices into dense vectors of fixed size, here, vectors of size 100
  - `Flatten layer`: Flattens the input. For instance, if the input is (batch_size, maxlen, 100), the output would be (batch_size, maxlen*100)
  - `Dense layers`: Fully connected neural network layers
  - `Dropout`: Randomly sets a fraction (here 0.5 or 50%) of input units to 0 at each update during training to prevent overfitting

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense, Dropout

model = Sequential()
model.add(Embedding(10000, 100, input_length=maxlen, weights=[embedding_matrix], trainable=False))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Display model summary
model.summary()

__Step 3. Train the Model and Visualize Performance__

Visualize the model's performance over time
- **Plot**
  - Using the history of the model training, plot the accuracy for both training and validation data across epochs. This visualization helps determine if the model is overfitting (when training accuracy increases but validation accuracy starts decreasing).

In [None]:
history = model.fit(X_train, y_train, epochs=5, batch_size=32, validation_data=(X_test, y_test))

# Visualization of training and validation set accuracy
plt.figure(figsize=(12, 6))
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()

## Discussion Questions

1. **Parameter Tuning**
    - How do different values of the `window` parameter in Word2Vec impact the quality of embeddings?
    - Briefly describe what the Adam optimizer does
2. **Methods**
    - Find out which version of Word2Vec is used by `gensim` and whether another variety of Word2Vec is better
    - Experiment with pretrained Word2Vec models such as Google News to see if results improve
    - Apart from Word2Vec, what other sentiment analysis techniques can you apply in this context? How would they compare?
3. **Model Evaluation**
    - What do the visualizations tell you about the effectiveness of the models, for both the word embeddings and the sentiment classifier?
    - Experiment with some domain specific sentiment words to see how the t-SNE plot for the word embeddings changes
    - Is the data of sufficient quantity and quality for the method to learn a good model?
    - How could you further preprocess data to enhance the quality of embeddings?
    - What other layers or activation functions could you use in the neural network to improve accuracy?