# Hashing Autoencoders and RAMBO

We are attempting to improve the performance of RAMBO by training an autoencoder to act as a hash function. Our eventual goal is to make a library of PDFs easily searchable, but for now we test with the AOL dataset used in class for our homeworks.

## Create training and validation data

Here we simply download the AOL dataset from the appropriate website. We will later hash it and split it into training and validation datasets.

In [1]:
import urllib.request
from pathlib import Path

AOL_URL = "http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection/user-ct-test-collection-01.txt"

data_dir = Path("data")
data_file = Path("data/aol.txt")

if not data_file.is_file():
    if not data_dir.is_dir():
        data_dir.mkdir(parents=True, exist_ok=True)

    with urllib.request.urlopen(AOL_URL) as data_url, data_file.open(
        "w", encoding="utf-8"
    ) as fd:
        fd.write(data_url.read().decode("utf-8"))

Let's read it into a Pandas `DataFrame` and extract the queries from it.

In [2]:
import numpy as np
import pandas as pd

data = pd.read_csv(data_file, sep="\t")
phrases = data.Query.dropna().unique().tolist()

We convert the phrases to lists of ASCII numbers and pad them to 512 elements in length.

In [3]:
PAD_CONST = 512


def word_to_ascii(word):
    ascii_word = list(map(ord, word))
    padded_ascii = ascii_word + ([0] * (PAD_CONST - len(ascii_word)))
    return padded_ascii


phrases_ascii = np.array(list(map(word_to_ascii, phrases)))
phrases_ascii.shape

(1216652, 512)

Now we hash them with MurmurHash for our `y` variable

In [4]:
from sklearn.utils import murmurhash3_32


# Taken from: https://stackoverflow.com/a/47521145
def vec_bin_array(arr, m):
    """
    Arguments:
    arr: Numpy array of positive integers
    m: Number of bits of each integer to retain

    Returns a copy of arr with every element replaced with a bit vector.
    Bits encoded as int8's.
    """
    to_str_func = np.vectorize(lambda x: np.binary_repr(x).zfill(m))
    strs = to_str_func(arr)
    ret = np.zeros(list(arr.shape) + [m], dtype=np.int64)
    for bit_ix in range(0, m):
        fetch_bit_func = np.vectorize(lambda x: x[bit_ix] == "1")
        ret[..., bit_ix] = fetch_bit_func(strs).astype("int8")

    return ret


phrases_hashed = vec_bin_array(
    np.array(list(map(lambda x: murmurhash3_32(x, seed=2021, positive=True), phrases))),
    32,
)
phrases_hashed.shape

(1216652, 32)

## Building our model

Before we train our encoder, we split the dataset into training, testing, and validation sets.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    phrases_ascii, phrases_hashed, test_size=0.2, random_state=2021
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, random_state=2021
)

Now we're finally ready to create `tf.Dataset` objects out of our data. This is an API provided by Tensorflow which allows for easy manipulation of data for training models.

In [6]:
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
val_dataset = tf.data.Dataset.from_tensor_slices((X_val, y_val))
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, y_test))

BATCH_SIZE = 32
SHUFFLE_BUFFER_SIZE = 96

train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
val_dataset = val_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2


Next, we define our model. We focus on the encoding portion of the encoder-decoder pair, as that is what concerns us the most.

In [9]:
from tensorflow import keras
from tensorflow.keras import layers


def get_encoder():
    inputs = keras.Input(shape=(512,))

    x = layers.Dense(256, activation="relu")(inputs)
    x = layers.Dense(128, activation="relu")(x)
    x = layers.Dense(64, activation="relu")(x)

    outputs = layers.Dense(32)(x)

    model = keras.Model(inputs, outputs, name="encoder")

    return model


model = get_encoder()
model.compile(
    loss="kl_divergence",
    optimizer="adam",
    metrics=["accuracy"],
)
model.summary()

Model: "encoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 512)]             0         
_________________________________________________________________
dense_4 (Dense)              (None, 256)               131328    
_________________________________________________________________
dense_5 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_6 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_7 (Dense)              (None, 32)                2080      
Total params: 174,560
Trainable params: 174,560
Non-trainable params: 0
_________________________________________________________________


In [10]:
callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor="val_loss", mode="min", patience=10, verbose=1
    ),
    tf.keras.callbacks.ReduceLROnPlateau(
        factor=0.1, patience=5, min_lr=0.00001, verbose=1
    ),
    tf.keras.callbacks.ModelCheckpoint(
        "model-tgs-salt.h5", verbose=1, save_best_only=True, save_weights_only=True
    ),
]

history = model.fit(
    train_dataset, epochs=20, callbacks=callbacks, validation_data=val_dataset
)

Epoch 1/20

Epoch 00001: val_loss improved from inf to 32.15446, saving model to model-tgs-salt.h5
Epoch 2/20

Epoch 00002: val_loss did not improve from 32.15446
Epoch 3/20

Epoch 00003: val_loss did not improve from 32.15446
Epoch 4/20

Epoch 00004: val_loss improved from 32.15446 to 32.15444, saving model to model-tgs-salt.h5
Epoch 5/20

Epoch 00005: val_loss improved from 32.15444 to 32.15443, saving model to model-tgs-salt.h5
Epoch 6/20

Epoch 00006: ReduceLROnPlateau reducing learning rate to 0.00010000000474974513.

Epoch 00006: val_loss did not improve from 32.15443
Epoch 7/20

KeyboardInterrupt: 

### Plotting training results

Here we plot our training results, just to convince ourselves that our model is working.

In [None]:
import matplotlib.pyplot as plt

#  "Accuracy"
plt.plot(history.history["acc"])
plt.plot(history.history["val_acc"])
plt.title("model accuracy")
plt.ylabel("accuracy")
plt.xlabel("epoch")
plt.legend(["train", "validation"], loc="upper left")
plt.show()
# "Loss"
plt.plot(history.history["loss"])
plt.plot(history.history["val_loss"])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "validation"], loc="upper left")
plt.show()

### Testing our model
We test our model on the test dataset we set aside earlier.

In [None]:
results = model.evaluate(test_dataset)

print("Test loss, test accuracy: ", results)

### Saving the model

We saved the model in Tensorflow's own `SavedModel` format to use later.

In [None]:
model.save("dense_encoder")