# Bag of words
This example takes us through the first type of tokenisation, bag of words. Here we dont take into account the order of words but instead group them into snippets of N in length. If we have a 3Ngram then we create a set out of a sentance with single words, words in pairs, and words in triplets. 

In [1]:
from tensorflow import keras
batch_size=32

In [2]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("C:/Users/kaspa/Documents/Code/AI_training/training_data/aclImdb_v1/aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"

for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir/category)
    random.Random(1337).shuffle(files)
    n_val_samples = int(0.2*len(files))
    val_files = files[-n_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname, val_dir / category / fname)
    

FileExistsError: [WinError 183] Cannot create a file when that file already exists: 'C:\\Users\\kaspa\\Documents\\Code\\AI_training\\training_data\\aclImdb_v1\\aclImdb\\val\\neg'

This above code should be the same as in the example. 

In [3]:
train_ds = keras.utils.text_dataset_from_directory("C:/Users/kaspa/Documents/Code/AI_training/training_data/aclImdb_v1/aclImdb/train", batch_size=32)
val_ds = keras.utils.text_dataset_from_directory("C:/Users/kaspa/Documents/Code/AI_training/training_data/aclImdb_v1/aclImdb/val",batch_size=32)
test_ds = keras.utils.text_dataset_from_directory("C:/Users/kaspa/Documents/Code/AI_training/training_data/aclImdb_v1/aclImdb/test",batch_size=32)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [4]:
from tensorflow.keras.layers import TextVectorization
text_vec = TextVectorization(max_tokens = 20000, output_mode = "multi_hot")
text_only_train_ds = train_ds.map(lambda x,y:x)
text_vec.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(lambda x,y: (text_vec(x),y), num_parallel_calls=4)
binary_1gram_val_ds = val_ds.map(lambda x,y: (text_vec(x),y), num_parallel_calls=4)
binary_1gram_test_ds = test_ds.map(lambda x,y: (text_vec(x),y), num_parallel_calls=4)

In [9]:
from tensorflow.keras import layers

def get_model(max_tokens=20000, hidden_dim=6):
    inputs = keras.Input(shape=(max_tokens,))
    x = layers.Dense(hidden_dim, activation="relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(optimizer="rmsprop", loss = "binary_crossentropy", metrics=["accuracy"])
    return model
    

In [12]:
model = get_model()
model.summary()

callbacks = keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)
model.fit(binary_1gram_train_ds.cache(), validation_data=binary_1gram_val_ds.cache(), epochs=10, callbacks=callbacks)
model = keras.models.load_model("binary_1gram.keras")

print(f"Test acc: {model.evaluate(binary_1gram_test_ds)[1]:.3f}")


Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_4 (Dense)             (None, 6)                 120006    
                                                                 
 dropout_2 (Dropout)         (None, 6)                 0         
                                                                 
 dense_5 (Dense)             (None, 1)                 7         
                                                                 
Total params: 120,013
Trainable params: 120,013
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.870


Nice to see how well even a simple binary 1gram does. This is a fairly complex task of figuring out which review is positive or negative. Even people who are learning a langague mught struggle at this without a year or so of learning. 

# Bigrams with binary encoding
Now onto using bag of words with some ability to interpret position. A bigram uses the pairs of words and so has a little more information to work with. I wonder how this all works into information theory.

In [14]:
text_vectorisation = TextVectorization(ngrams=2, max_tokens=20000, output_mode="multi_hot")

In [15]:
text_vec.adapt(text_only_train_ds)

binary_2gram_train_ds = train_ds.map(lambda x,y: (text_vec(x),y), num_parallel_calls=4)
binary_2gram_val_ds = val_ds.map(lambda x,y: (text_vec(x),y), num_parallel_calls=4)
binary_2gram_test_ds = test_ds.map(lambda x,y: (text_vec(x),y), num_parallel_calls=4)

In [16]:
model = get_model()
model.summary()

callbacks = keras.callbacks.ModelCheckpoint("binary_2gram.keras", save_best_only=True)
model.fit(binary_2gram_train_ds.cache(), validation_data=binary_2gram_val_ds.cache(), epochs=10, callbacks=callbacks)
model = keras.models.load_model("binary_2gram.keras")

print(f"Test acc: {model.evaluate(binary_2gram_test_ds)[1]:.3f}")

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 20000)]           0         
                                                                 
 dense_6 (Dense)             (None, 6)                 120006    
                                                                 
 dropout_3 (Dropout)         (None, 6)                 0         
                                                                 
 dense_7 (Dense)             (None, 1)                 7         
                                                                 
Total params: 120,013
Trainable params: 120,013
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test acc: 0.882


Amazed at how quick that was to change to model setup to a 2gram. That TextVectorisation function is mean! 