### Keras: Python library for deep learning
### Tensorflow: Open source library for ML


### 1. Setup

In [29]:
import pandas as pd 
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import string
import re
import seaborn as sns
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

### 2. Load Data
* We will use tf.keras.preprocessing.text_dataset_from_directory utility to transform our data in tf.data.Dataset format. 

In [3]:
batch_size = 32
seed = 42
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory('/home/nirzaree/Data/MLDatasets/StanfordMovieRatingDatabase/aclImdb_v1/aclImdb/train/',
                                                                 batch_size=batch_size,
                                                                 validation_split=0.2,
                                                                 subset='training',
                                                                 seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [27]:
raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory('/home/nirzaree/Data/MLDatasets/StanfordMovieRatingDatabase/aclImdb_v1/aclImdb/train/',
                                                                 batch_size=batch_size,
                                                                 validation_split=0.2,
                                                                 subset='validation',
                                                                 seed=seed)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


#### Check some data

In [6]:
for text_batch,label_batch in raw_train_ds.take(1):
    for i in range(3):
        print('Review:',text_batch.numpy()[i])
        print('Label:',label_batch.numpy()[i])

Review: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label: 0
Review: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get in

* Label mapping check

In [7]:
print('Label: 0 corresponds to',raw_train_ds.class_names[0])
print('Label: 1 corresponds to',raw_train_ds.class_names[1])

Label: 0 corresponds to neg
Label: 1 corresponds to pos


In [11]:
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory('/home/nirzaree/Data/MLDatasets/StanfordMovieRatingDatabase/aclImdb_v1/aclImdb/test',batch_size=batch_size)

Found 25000 files belonging to 2 classes.


### 3. Preprocess the data: Standardize, Tokenize, Vectorize
* Standardization
* Tokenization
* Vectorization

In [9]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [18]:
def standardize_input_text(input_data):
    lowercase = tf.strings.lower(input_data)
    html_removed = tf.strings.regex_replace(lowercase,'<br />',' ')
    return tf.strings.regex_replace(html_removed,'[%s]' % re.escape(string.punctuation),'')

* todo: understand:
        '[%s]' % re.escape(string.punctuation)

In [None]:
# * Why cant re be used here? like this:
#     def standardize_input_text(input_data):
#     input_data = re.sub('<br />',' ',input_data)
#     input_data = re.sub(string.punctuation,' ',input_data)
#     return(input_data)

### Vectorization layer

In [19]:
max_features = 10000
sequence_length = 250
vectorization_layer = TextVectorization(standardize = standardize_input_text,
                                        max_tokens = max_features,
                                        output_mode = 'int',
                                        output_sequence_length = sequence_length)

* call the adapt method to fit the vectorization on the training data

In [20]:
#only text not labels
train_text = raw_train_ds.map(lambda x,y:x)
vectorization_layer.adapt(train_text)

#### Check vectorization output

In [21]:
def vectorize_text(text,label):
    text = tf.expand_dims(text,-1) #Returns a tensor with a length 1 axis inserted at index axis.
    return vectorization_layer(text),label

In [25]:
text_batch,label_batch = next(iter(raw_train_ds))
first_review,first_label = text_batch[0],label_batch[0]
print('First review',first_review)
print('First label',raw_train_ds.class_names[first_label])
print('Vectorized review',vectorize_text(first_review,first_label))

First review tf.Tensor(b'Recipe for one of the worst movies of all time: a she-male villain who looks like it escaped from the WWF, has terrible aim with a gun that has inconsistent effects (the first guy she shoots catches on fire but when she shoots anyone else they just disappear) and takes time out to pet a deer. Then you got the unlikable characters, 30 year old college students, a lame attempt at a surprise ending and lots, lots more. Avoid at all costs.', shape=(), dtype=string)
First label neg
Vectorized review (<tf.Tensor: shape=(1, 250), dtype=int64, numpy=
array([[9257,   15,   28,    5,    2,  241,   91,    5,   30,   58,    4,
           1, 1011,   36,  262,   38,    9, 3891,   35,    2,    1,   43,
         382, 5223,   16,    4, 1113,   12,   43, 5739,  300,    2,   83,
         225,   55, 3209, 3898,   20,  973,   18,   51,   55, 3209,  250,
         320,   34,   40, 4386,    3,  294,   58,   44,    6, 2911,    4,
        6757,   92,   22,  184,    2, 4916,  100, 1221, 

In [26]:
print('9257 -->', vectorization_layer.get_vocabulary()[9257])
print('15 -->', vectorization_layer.get_vocabulary()[15])
print('28 -->', vectorization_layer.get_vocabulary()[28])


9257 --> recipe
15 --> for
28 --> one


#### Apply vectorization on validation and test set 

In [28]:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)


APIs:
    1. tf.keras.preprocessing.text_dataset_from_directory: to prepare data in tf.data.Dataset format from directory structure on the disk
    2. ds.take(n) takes nth example or n examples. Need to check
    3. text_batch,label_batch from ds.take(n) and then text_batch.numpy()[i] for printing ith sample text and label_batch.numpy()[i] for ith sample label
    4. TextVectorization from tf.keras.layers.experimental to vectorize the text
    5. expand_dims: returns a tensor with added dim of length 1 at 'axis' index