<a href="https://colab.research.google.com/github/TirendazAcademy/An-LLM-App-with-Chainlit/blob/main/Bag-of-Words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

# Loading the Dataset

In [None]:
raw_train_ds, raw_val_ds, raw_test_ds = tfds.load(
    name="imdb_reviews",
    split=["train[:90%]","train[90%:]","test"],
    as_supervised=True
)

Downloading and preparing dataset 80.23 MiB (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteQKW7OR/imdb_reviews-train.tfrecord…

Generating test examples...:   0%|          | 0/25000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteQKW7OR/imdb_reviews-test.tfrecord*…

Generating unsupervised examples...:   0%|          | 0/50000 [00:00<?, ? examples/s]

Shuffling /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteQKW7OR/imdb_reviews-unsupervised.t…

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.


# Understanding the Dataset

In [None]:
for review, label in raw_train_ds.take(3):
  print(review.numpy().decode("utf-8"))
  print("Label: ", label.numpy() )

This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.
Label:  0
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot developmen

# Model Configuration

In [None]:
tf.random.set_seed(42)
train_ds = raw_train_ds.shuffle(5000, seed=42).batch(32).prefetch(1)
val_ds = raw_val_ds.batch(32).prefetch(1)
test_ds = raw_test_ds.batch(32).prefetch(1)

# Data Preprocessing

In [None]:
text_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot"
)

In [None]:
text_only_train_ds = train_ds.map(lambda x, y: x)

In [None]:
text_vectorization.adapt(text_only_train_ds)

# The Unigrains Approach

In [None]:
binary_1gram_train_ds=train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
binary_1gram_val_ds=val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_1gram_test_ds=test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
for review, label in binary_1gram_train_ds.take(1):
  print(review[0])
  print("Label: ", label[0])

tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
Label:  tf.Tensor(1, shape=(), dtype=int64)


In [None]:
def get_model(max_tokens=20000, hidden_dim=16):
  inputs= tf.keras.Input(shape=(max_tokens,))
  x = tf.keras.layers.Dense(hidden_dim, activation="relu")(inputs)
  x = tf.keras.layers.Dropout(0.5)(x)
  outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)
  model =tf.keras.Model(inputs, outputs)
  model.compile(
      optimizer= "rmsprop",
      loss="binary_crossentropy",
      metrics=["accuracy"]
  )
  return model

In [None]:
model = get_model()
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense (Dense)               (None, 16)                320016    
                                                                 
 dropout (Dropout)           (None, 16)                0         
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 320,033
Trainable params: 320,033
Non-trainable params: 0
_________________________________________________________________


In [None]:
callbacks=[tf.keras.callbacks.ModelCheckpoint(
    "binary_1gram.keras",
    save_best_only=True
)]

In [None]:
model.fit(binary_1gram_train_ds.cache(),
          validation_data=binary_1gram_val_ds,
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff6295cc460>

In [None]:
model = tf.keras.models.load_model("binary_1gram.keras")
print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")

Test acc: 0.889


The Bigrains Approach

In [None]:
text_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=20000,
    output_mode="multi_hot",
    ngrams=2,
)
text_vectorization.adapt(text_only_train_ds)

In [None]:
binary_2gram_train_ds=train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_val_ds=val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
binary_2gram_test_ds=test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
model = get_model()
callbacks=[tf.keras.callbacks.ModelCheckpoint(
    "binary_2gram.keras",
    save_best_only=True
)]
model.fit(binary_2gram_train_ds.cache(),
          validation_data=binary_2gram_val_ds,
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff62ba06b00>

In [None]:
model = tf.keras.models.load_model("binary_2gram.keras")
print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Test acc: 0.902


# TF-IDF Approach

In [None]:
text_vectorization = tf.keras.layers.TextVectorization(
    max_tokens=20000,
    output_mode="tf_idf",
    ngrams=2,
)
text_vectorization.adapt(text_only_train_ds)

In [None]:
tfidf_2gram_train_ds=train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_val_ds=val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
tfidf_2gram_test_ds=test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
model = get_model()
callbacks=[tf.keras.callbacks.ModelCheckpoint(
    "tfidf_2gram.keras",
    save_best_only=True
)]
model.fit(tfidf_2gram_train_ds.cache(),
          validation_data=tfidf_2gram_val_ds,
          epochs=10,
          callbacks=callbacks)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff62a33e6e0>

In [None]:
model = tf.keras.models.load_model("tfidf_2gram.keras")
print(f"Test acc: {model.evaluate(tfidf_2gram_test_ds)[1]:.3f}")

Test acc: 0.883


# Model Export

In [None]:
inputs=tf.keras.Input(shape=(1,), dtype="string")
x = text_vectorization(inputs)
outputs = model(x)
inference_model = tf.keras.Model(inputs, outputs)

In [None]:
text_data= tf.convert_to_tensor(
    ["This movie is great. I liked it."])

In [None]:
predictions = inference_model(text_data)

In [None]:
print(f"{float(predictions[0]*100):.2f} percent positive")

86.51 percent positive


Let's connect [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [Instagram](https://www.instagram.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) 😎