<a href="https://colab.research.google.com/github/AnsiaNijas/ADS1/blob/main/sentiment_analysis_glove_bilstm_structured.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with Pretrained GloVe Embeddings and LSTM on the IMDB Dataset
This tutorial demonstrates how to perform sentiment analysis on the IMDB dataset using pretrained GloVe embeddings and a Bidirectional LSTM model.

## 1. Introduction
Sentiment analysis is a critical natural language processing (NLP) task that determines whether a piece of text expresses a positive, negative, or neutral sentiment. This tutorial explores how to combine GloVe embeddings with LSTM to understand the sentiment in text effectively.

## 2. Setup
Import the required libraries and packages for this project.

## 3. Load Data
Load the IMDB dataset using TensorFlow Datasets.

## 4. Preprocess Data
Prepare the data for training by tokenizing text, creating sequences, and padding them to uniform lengths.

## 5. Build Model
Define the architecture of the sentiment analysis model using GloVe embeddings and Bidirectional LSTM.

## 6. Train and Evaluate
Train the model on the prepared data and evaluate its performance on the test set.

## 7. Export Model
Save the trained model for future inference.

## 8. Visualizations
Generate plots to visualize the training and validation accuracy and loss.

## 9. Conclusion
Summarize the results and discuss possible future improvements to the model.

## **4. Code Walkthrough**
### **Setup: Preparing the Environment**

In this step, we prepare the environment by importing the necessary libraries and dependencies required for the tutorial. Each library serves a specific purpose in the workflow:

1. **`numpy`**:
   - Used for numerical computations, such as creating and manipulating arrays.
   - Essential for working with the embedding matrix for GloVe.

2. **`tensorflow`**:
   - The primary framework used to build, train, and evaluate machine learning models.
   - Provides utilities for neural network layers, optimization algorithms, and training workflows.

3. **`tensorflow_datasets`**:
   - A module that simplifies dataset loading and preprocessing.
   - It will be used to import the IMDB dataset, which is pre-split into training and testing sets.

4. **`tensorflow.keras.preprocessing.text`**:
   - Provides tools to process textual data by tokenizing words into sequences of integers.

5. **`tensorflow.keras.preprocessing.sequence`**:
   - Used to pad or truncate sequences to ensure they have uniform length, making them suitable for input to the model.

6. **`os`**:
   - Enables interaction with the operating system.
   - Will be used to load the GloVe embeddings from local storage.

By importing these libraries, we set up the foundational tools needed to complete the rest of the tutorial, such as data preparation, model building, training, and evaluation.



In [None]:
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import os

## Load the IMDB Dataset

In [None]:
# Load IMDB dataset
(train_data, test_data), info = tfds.load(
    'imdb_reviews',
    split=['train', 'test'],
    as_supervised=True,
    with_info=True
)

## Preprocess the Data

In [None]:
# Extract sentences and labels
train_sentences, train_labels = [], []
test_sentences, test_labels = [], []

for s, l in train_data:
    train_sentences.append(s.numpy().decode('utf8'))
    train_labels.append(l.numpy())

for s, l in test_data:
    test_sentences.append(s.numpy().decode('utf8'))
    test_labels.append(l.numpy())

train_labels = np.array(train_labels)
test_labels = np.array(test_labels)

## Tokenize and Pad Sequences

In [None]:
# Parameters
vocab_size = 20000
embedding_dim = 100
max_length = 100
trunc_type = 'post'
padding_type = 'post'
oov_tok = "<OOV>"

# Tokenizer
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

# Convert texts to sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
test_sequences = tokenizer.texts_to_sequences(test_sentences)

# Pad sequences
train_padded = pad_sequences(train_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)
test_padded = pad_sequences(test_sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

## Load Pretrained GloVe Embeddings

In [None]:
# Download GloVe embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip -q glove.6B.zip -d ./glove.6B/

# Load embeddings into a dictionary
embeddings_index = {}
with open('./glove.6B/glove.6B.100d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

## Create Embedding Matrix

In [None]:
# Create embedding matrix
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
    if i < vocab_size:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

## Build the Model

In [None]:
# Build the model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length, weights=[embedding_matrix], trainable=False),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Model summary
model.summary()

## Train the Model

In [None]:
# Train the model
num_epochs = 10
history = model.fit(
    train_padded,
    train_labels,
    epochs=num_epochs,
    validation_data=(test_padded, test_labels),
    verbose=2
)

## Evaluate the Model

In [None]:
# Evaluate the model
loss, accuracy = model.evaluate(test_padded, test_labels)
print(f'Test Accuracy: {accuracy:.4f}')

## Save the Model

In [None]:
# Save the model
model.save('sentiment_analysis_glove_bilstm.h5')

## Load and Use the Model

In [None]:
# Load the model
loaded_model = tf.keras.models.load_model('sentiment_analysis_glove_bilstm.h5')

# Predict on new data
sample_text = ["The movie was fantastic! I really enjoyed it."]
sample_seq = tokenizer.texts_to_sequences(sample_text)
sample_padded = pad_sequences(sample_seq, maxlen=max_length, padding=padding_type, truncating=trunc_type)
prediction = loaded_model.predict(sample_padded)
print(f'Prediction: {prediction[0][0]:.4f}')