### Working on IMBD movie reviews data

Processing words as a set: The bag-of-words approach
I will use bag-of-words model

You can also try sequence model. But the model in use will give higher accuracy on this dataset

**Using the text vectorization layer**

In [None]:
import string

class Vectorizer:
    def standardize(self, text):
        text = text.lower()
        return "".join(char for char in text if char not in string.punctuation)

    def tokenize(self, text):
        text = self.standardize(text)
        return text.split()

    def make_vocabulary(self, dataset):
        self.vocabulary = {"": 0, "[UNK]": 1}
        for text in dataset:
            text = self.standardize(text)
            tokens = self.tokenize(text)
            for token in tokens:
                if token not in self.vocabulary:
                    self.vocabulary[token] = len(self.vocabulary)
        self.inverse_vocabulary = dict(
            (v, k) for k, v in self.vocabulary.items())

    def encode(self, text):
        text = self.standardize(text)
        tokens = self.tokenize(text)
        return [self.vocabulary.get(token, 1) for token in tokens]

    def decode(self, int_sequence):
        return " ".join(
            self.inverse_vocabulary.get(i, "[UNK]") for i in int_sequence)


# Using make_vocabulary, encode, and decode method of the Vectorizer class:
vectorizer = Vectorizer()
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
vectorizer.make_vocabulary(dataset)

In [None]:
"""
(Resolved Error in the above code cell)

I was facing an entirely different error description. But the actual error was something else.
According to the error description:
this code is not iterable: "for token in tokens:" in the make_vocabulary method


Actual error:
This was the error in the tokenize method
Incorrect: "return text.split"
Correct: "return text.split()"
"""


'\n(Resolved Error in the above code cell)\n\nI was facing an entirely different error description. But the actual error was something else.\nAccording to the error description:\nthis code is not iterable: "for token in tokens:" in the make_vocabulary method\n\n\nActual error:\nThis was the error in the tokenize method\nIncorrect: "return text.split"\nCorrect: "return text.split()"\n'

In [None]:
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = vectorizer.encode(test_sentence)
print(encoded_sentence)

[2, 3, 5, 7, 1, 5, 6]


In [None]:
decoded_sentence = vectorizer.decode(encoded_sentence)
print(decoded_sentence)

i write rewrite and [UNK] rewrite again


In [None]:
# Pretty amazing result!
# Now lets proceed!

In [None]:
# Now we are using the TextVectorization layer
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    output_mode="int"
)
# Why in the next cell you are doing a different configuration of this layer.

In [None]:
import re
import string
import tensorflow as tf

def custom_standardization_fn(string_tensor):
  lowercase_string = tf.strings.lower(string_tensor)
  return tf.strings.regex_replace(
      lowercase_string, f"[{re.escape(string.punctuation)}]", "")
# what does regex_replace method does?

def custom_split_fn(string_tensor):
  return tf.strings.split(string_tensor)


# Now configuring the layer. I'm amazed to see that you have put custom functions in the parameters of the layer.
  # Those parameters are 'standardize' and 'split'
text_vectorization = TextVectorization(
    output_mode="int",
    standardize=custom_standardization_fn,
    split=custom_split_fn,
    )

In [None]:
dataset = [
    "I write, erase, rewrite",
    "Erase again, and then",
    "A poppy blooms.",
]
text_vectorization.adapt(dataset)

# I think that this text_vectorization's method "adapt" is working similarly to the "make_vocabulary" method
  # Ans: Actually the combination of adapt and get_vocabulary is working similar to make_vocabulary method

**Displaying the vocabulary**

In [None]:
#text_vectorization.get_vocabulary()

vocabulary = text_vectorization.get_vocabulary()
test_sentence = "I write, rewrite, and still rewrite again"
encoded_sentence = text_vectorization(test_sentence)
print(encoded_sentence)

inverse_vocab = dict(enumerate(vocabulary))
decoded_sentence = " ".join(inverse_vocab[int(i)] for i in encoded_sentence)
print(decoded_sentence)

tf.Tensor([ 7  3  5  9  1  5 10], shape=(7,), dtype=int64)
i write rewrite and [UNK] rewrite again


In [None]:
"""
Uptill now, I know that the TextVectorization is imported from tensorflow.
This parameters of this layer are customized by our own custom functions.

How we are using it after the configuration?
- Using this method: adapt(dataset)
- Using this method: get_vocabulary()

Then to encode any new data, we are just using the whole layer directly:
-  text_vectorization(test_sentence)

But we are decoding it by ourselves.
"""

'\nUptill now, I know that the TextVectorization is imported from tensorflow.\nThis parameters of this layer are customized by our own custom functions.\n\nHow we are using it after the configuration?\n- Using this method: adapt(dataset)\n- Using this method: get_vocabulary()\n\nThen to encode any new data, we are just using the whole layer directly:\n-  text_vectorization(test_sentence)\n\nBut we are decoding it by ourselves.\n'

In [None]:
"""
Decoding differences:
our own layer: [2, 3, 5, 7, 1, 5, 6]
tensorflow: [ 7  3  5  9  1  5 10]

But the output is same
"""

'\nDecoding differences:\nour own layer: [2, 3, 5, 7, 1, 5, 6]\ntensorflow: [ 7  3  5  9  1  5 10]\n\nBut the output is same\n'

### Two approaches for representing groups of words: Sets and sequences

**Preparing the IMDB movie reviews data**

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  26.7M      0  0:00:03  0:00:03 --:--:-- 26.7M


In [None]:
!rm -r aclImdb/train/unsup
!cat aclImdb/train/pos/4077_10.txt

I first saw this back in the early 90s on UK TV, i did like it then but i missed the chance to tape it, many years passed but the film always stuck with me and i lost hope of seeing it TV again, the main thing that stuck with me was the end, the hole castle part really touched me, its easy to watch, has a great story, great music, the list goes on and on, its OK me saying how good it is but everyone will take there own best bits away with them once they have seen it, yes the animation is top notch and beautiful to watch, it does show its age in a very few parts but that has now become part of it beauty, i am so glad it has came out on DVD as it is one of my top 10 films of all time. Buy it or rent it just see it, best viewing is at night alone with drink and food in reach so you don't have to stop the film.<br /><br />Enjoy

In [None]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"

for category in ("neg", "pos"):
  os.makedirs(val_dir / category)
  files = os.listdir(train_dir / category)
  random.Random(1337).shuffle(files)
  num_val_samples = int(0.2 * len(files))   # I think: taking 20% validation data
  val_files = files[-num_val_samples:]
  for fname in val_files:                  # Moving data to val_dir from train_dir
    shutil.move(train_dir / category / fname,
                val_dir / category / fname)

In [None]:
# For debugging:
#shutil.rmtree("aclImdb/val/neg", ignore_errors=True)
#shutil.rmtree("aclImdb/val/pos", ignore_errors=True)

# Nothing more fearful than a powerful error.
  # Fight hard!

In [None]:
# Rough Work:
#num_val_samples #2500
#len(files)      #12500
#files[:100]     # all are txt files

In [None]:
# Why the test directory is formed?
#Ans: I THINK:
# test and the train directory was created previously,
# when the archive was created.
# We have only created 'val' directory,
# Maybe you can clear this by referring the book.

In [None]:
from tensorflow import keras
batch_size = 32

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [None]:
# I think Method utils.text_dataset_from_directory from keras considers text files only
# type(train_ds)   # tensorflow.python.data.ops.dataset_ops.BatchDataset      # rough
# Will we use the all three directories?
   #  Ans: Yes

In [None]:
# Rough Work
#output of "train_ds"
#<BatchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

In [None]:
# Each iteration will being "one batch" of 32 text files
for inputs, targets in train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32,)
inputs.dtype: <dtype: 'string'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor(b'The movie is wonderful. It shows the man\'s work for the wilderness and a natural understanding of the harmony of nature, without being an "extreme" naturalist. I definitely plan to look for the book. This is a rare treasure!<br /><br />', shape=(), dtype=string)
targets[0]: tf.Tensor(1, shape=(), dtype=int32)


In [None]:
# MY TESTING:
#print(inputs[32]) # error in this test code
# targets   # array of 1's and 0's
#len(inputs[0]) # error
#inputs[0].dtype # tf.string
#a = tf.strings.length(inputs[:]).numpy() # 1270
#print(a, max(a), min(a),sep='\n')

### Processing words as a set: The bag-of-words approach

In [None]:
# Yes! I will not do as a sequence. Because I'm interested in bag-of-words approach only due to a better accuracy.

#### Single words (unigrams) with binary encoding

**Preprocessing our datasets with a TextVectorization layer**

In [None]:
text_vectorization = TextVectorization(
    max_tokens = 20000,
    output_mode = "multi_hot"
)
text_only_train_ds = train_ds.map(lambda x, y:x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
# Inspecting the output of our binary unigram dataset:
for inputs, targets in binary_1gram_train_ds:
    print("inputs.shape:", inputs.shape)
    print("inputs.dtype:", inputs.dtype)
    print("targets.shape:", targets.shape)
    print("targets.dtype:", targets.dtype)
    print("inputs[0]:", inputs[0])
    print("targets[0]:", targets[0])
    break

inputs.shape: (32, 20000)
inputs.dtype: <dtype: 'float32'>
targets.shape: (32,)
targets.dtype: <dtype: 'int32'>
inputs[0]: tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
targets[0]: tf.Tensor(0, shape=(), dtype=int32)


In [None]:
#print(inputs[0][:1000].numpy())   # length = 20000
# Whole text data is now converted to 1's and 0's

**Building Model**

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=16):
  inputs = keras.Input(shape=(max_tokens,))
  x = layers.Dense(hidden_dim, activation="relu")(inputs)
  x = layers.Dropout(0.5)(x)
  outputs = layers.Dense(1, activation="sigmoid")(x)
  model = keras.Model(inputs, outputs)
  model.compile(optimizer="rmsprop",
                loss="binary_crossentropy",
                metrics=["accuracy"])
  return model

In [None]:
model = get_model()
model.summary()

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_6 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_6 (Dense)             (None, 16)                320016    
                                                                 
 dropout_3 (Dropout)         (None, 16)                0         
                                                                 
 dense_7 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


**Training and testing the binary unigram model**

In [None]:
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_1gram.keras",
                                    save_best_only=True)
]
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")              # Why loading the model now?
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.887


In [None]:
# MY TESTING:
#print(f"Test acc: {model.evaluate(binary_1gram_test_ds)}")

# MODEL TRAINING TIME WITHOUT GPU = ~ 1min
# TEST ACCURACY: 88.7%

Test acc: [0.2882535457611084, 0.8868399858474731]


In [None]:
# what is the purpose of ".cache" in the above code cell

#### Bigrams with binary encoding

In [None]:
#Preprocessing/Configuring the TextVectorization layer to return bigrams
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="multi_hot",
)

**Training and testing the binary bigram model**

In [None]:
text_vectorization.adapt(text_only_train_ds)
binary_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("binary_2gram.keras",
                                    save_best_only=True)
]

Model: "model_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_8 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_10 (Dense)            (None, 16)                320016    
                                                                 
 dropout_5 (Dropout)         (None, 16)                0         
                                                                 
 dense_11 (Dense)            (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

In [None]:
# TEST ACCURACY: 89.6%
# MODEL TRAINING AND TESTING TIME ~ 1.5min 

In [None]:
# I'M CHANGING EPOCHS
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds.cache(),
          epochs=4,
          callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4
Test acc: 0.899


In [None]:
# TEST ACCURACY: 89.7%
# MODEL TRAINING AND TESTING TIME = 45s

#### Bigrams with TF-IDF encoding

In [None]:
# Configuring the TextVectorization layer to return token counts
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="count"
)

In [None]:
# Configuring TextVectorization to return TF-IDF-weighted outputs
text_vectorization = TextVectorization(
    ngrams=2,
    max_tokens=20000,
    output_mode="tf_idf",
)

# Why overwriting?

In [None]:
text_only_train_ds = train_ds.map(lambda x, y:x)

**Training and testing the TF-IDF bigram model**

In [None]:
text_vectorization.adapt(text_only_train_ds)

tfidf_2gram_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

model = get_model()
model.summary()
callbacks = [
    keras.callbacks.ModelCheckpoint("tfidf_2gram.keras",
                                    save_best_only=True)
]

model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds.cache(),
          epochs=10,
          callbacks=callbacks)
model = keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_2 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_2 (Dense)             (None, 16)                320016    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.878


In [None]:
# TEST ACCURACY: 87.8%
# MODEL TRAINING AND TESTING TIME ~ 2min

In [None]:
model = keras.models.load_model("binary_2gram.keras")
# "binary_2gram.keras"   "tfidf_2gram.keras"

In [None]:
inputs = keras.Input(shape=(1,),dtype="string")
processed_inputs = text_vectorization(inputs)
outputs = model(processed_inputs)
inference_model = keras.Model(inputs, outputs)

In [None]:
import tensorflow as tf
raw_text_data = tf.convert_to_tensor([["This movie was a great one"],])
predictions = inference_model(raw_text_data)
print(f"{float(predictions[0]*100):.2f} percent positive")

62.60 percent positive


In [None]:
# TESTING:
# Book Review:   "That was an excellent movie, I loved it."   95%
# My review 1:   "Crazy"    49%
# My review 2:   "I liked the story and all the characters"    77%
# My review 3:   "Happy ending!"    57%
# "The story was a rubbish" 39%
# "I hate this movie" 45%
# "I did not find this movie interesting" # 45%
# "I found this movie interesting" # 50%
# "One of the best characters were acting in this movie and the story was very interesting" # 62#
#"Awesome" , "Great" 58%
# "This movie was a great one"# 63%

In [None]:
# I will run TD-IDF model again and bring 95% accuracy again and test other reviews also..

In [None]:
# ALHAMDULILLAH!

### **SUMMARY**

**Processing words as a “set”: The bag-of-words approach**

1.   Binary Unigram Model
2.   Binary Bigram Model
3.   TF-IDF Bigram Model