# IMDB movie review sentiment classification with BERTs (aclImdb version)

In this notebook, we'll use a pretrained BERT model for sentiment classification using **Tensorflow** with the **Keras API**. 

First, the needed imports.

In [None]:
%matplotlib inline

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.utils import plot_model

import keras_nlp

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

print('Using Tensorflow version: {}, and Keras version: {}.'.format(tf.__version__, tf.keras.__version__))

Let's check if we have GPU available.

In [None]:
use_fp16 = False

gpus = tf.config.list_physical_devices('GPU')
if len(gpus) > 0:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
    from tensorflow.python.client import device_lib
    for d in device_lib.list_local_devices():
        if d.device_type == 'GPU':
            print('GPU', d.physical_device_desc)
    if use_fp16:
        keras.mixed_precision.set_global_policy("mixed_float16")
else:
    print('No GPU, using CPU instead.')

## IMDB data set

Next we'll load the IMDB data set. First time we may have to download the data, which can take a while.

The dataset contains 50000 movies reviews from the Internet Movie Database, split into 25000 reviews for training and 25000 reviews for testing. Half of the reviews are positive (1) and half are negative (0).

The dataset consists of movie reviews as text files in a directory hierarchy, and we use the `text_dataset_from_directory()` function to create a `tf.data.Dataset()` from the text files.

In [None]:
DATADIR = "/media/gpu-data/imdb"
BATCH_SIZE = 16
validation_split=0.2
seed = 42

In [None]:
print('Train:')
imdb_train = tf.keras.utils.text_dataset_from_directory(
    DATADIR+"/aclImdb/train",
    batch_size=BATCH_SIZE,
    validation_split=validation_split,
    subset='training',
    seed=seed
)
print('\nValidation:')
imdb_valid = tf.keras.utils.text_dataset_from_directory(
    DATADIR+"/aclImdb/train",
    batch_size=BATCH_SIZE,
    validation_split=validation_split,
    subset='validation',
    seed=seed
)
print('\nTest:')
imdb_test = tf.keras.utils.text_dataset_from_directory(
    DATADIR+"/aclImdb/test",
    batch_size=BATCH_SIZE,
)
print('\nAn example review:')
print(imdb_train.unbatch().take(1).get_single_element())

## Pretrained BERT model

### Initialization

In [None]:
bertmodel = keras_nlp.models.BertClassifier.from_preset("bert_tiny_en_uncased_sst2")

In [None]:
plot_model(bertmodel, show_shapes=True)

### Inference

In [None]:
scores = bertmodel.evaluate(imdb_test, verbose=2)
for i, m in enumerate(bertmodel.metrics_names):
    print("%s: %.4f" % (m, scores[i]))

In [None]:
myreview = 'This movie was the worst I have ever seen and the actors were horrible.'
#myreview = 'This movie is great and I madly love the plot from beginning to end.'

logits = bertmodel.predict([myreview], batch_size=1)
probs = tf.nn.softmax(logits).numpy().squeeze()
print('Predicted sentiment: {}TIVE ({:.4f}/{:.4f})'.format("POSI" if probs[1]>probs[0] else "NEGA", probs[0], probs[1]))

## Finetuned BERT model

### Initialization

In [None]:
bertmodel2 = keras_nlp.models.BertClassifier.from_preset(
    "bert_tiny_en_uncased",
    num_classes=2,
)

In [None]:
plot_model(bertmodel2, show_shapes=True)

### Learning

In [None]:
%%time

epochs = 1

bertmodel2.fit(imdb_train,
               validation_data=imdb_valid,
               epochs=epochs)

### Inference

In [None]:
scores2 = bertmodel2.evaluate(imdb_test, verbose=2)
for i, m in enumerate(bertmodel2.metrics_names):
    print("%s: %.4f" % (m, scores2[i]))

In [None]:
myreview = 'This movie was the worst I have ever seen and the actors were horrible.'
#myreview = 'This movie is great and I madly love the plot from beginning to end.'

logits = bertmodel2.predict([myreview], batch_size=1)
probs = tf.nn.softmax(logits).numpy().squeeze()
print('Predicted sentiment: {}TIVE ({:.4f}/{:.4f})'.format("POSI" if probs[1]>probs[0] else "NEGA", probs[0], probs[1]))

---
*Run this notebook in Google Colaboratory using [this link](https://colab.research.google.com/github/csc-training/intro-to-dl/blob/master/day1/04-tf2-imdb-rnn.ipynb).*