## **BOLT Oracle**
Oracle is all-purpose classifier for tabular datasets. In addition to learning from the columns of a single row, Oracle can make use of "temporal context". For example, if used to build a movie recommender, Oracle may use information about the last 5 movies that a user has watched to recommend the next movie. Similarly, if used to forecast the outcome of marketing campaigns, Oracle may use several months' worth of campaign history for each product to make better forecasts.

### **0. Install Packages**

In [None]:
!pip3 install thirdai
!pip3 install pandas

### **1. Dataset and Task**
The code below downloads and cleans the Movielens1M dataset, which contains one million ratings given by 6,040 users to 3,706 movies. Each of the chronologically ordered rows consists of a user ID, a movie ID, a rating, and a timestamp. This makes it perfect for tasks like next item prediction; given a history of items that each user has interacted with, predict the item that each user will interact with next.

In [None]:
from thirdai import bolt
import os
import pandas as pd
import zipfile

MOVIELENS_1M_URL = "https://files.grouplens.org/datasets/movielens/ml-1m.zip"

ZIP = "./movielens.zip"
DIR = "./movielens"
RATINGS_FILE = DIR + "/ml-1m/ratings.dat"
TRAIN_FILE = "./movielens_train.csv"
TEST_FILE = "./movielens_test.csv"
PREDICTION_FILE = "./movielens_predictions.txt"

def download_movielens_1m_dataset():
    if not os.path.exists(ZIP):
        os.system(
            f"curl {MOVIELENS_1M_URL} --output {ZIP}"
        )

    if not os.path.exists(DIR):
        with zipfile.ZipFile(ZIP, 'r') as zip_ref:
            zip_ref.extractall(DIR)

def format_and_split_dataset():
    if os.path.exists(TRAIN_FILE) and os.path.exists(TEST_FILE):
        return

    df = pd.read_csv(RATINGS_FILE, header=None, delimiter='::')
    df.columns = ["userId", "movieId", "rating", "timestamp"]
    print("Cleaned column names")

    # Convert timestamp from seconds since epoch to YYYY-MM-DD format.
    df["timestamp"] = pd.to_datetime(df["timestamp"], unit='s')
    print("Cleaned timestamp column")

    # For the next item prediction task, we move the last interaction
    # of every user to the test set, and leave everything else in the
    # train set.
    df_test = (
        df.groupby("userId")
            .apply(lambda dfg: dfg.sort_values("timestamp").iloc[-1:])
            .sort_values("timestamp")
    )
    df_train = (
        df.groupby("userId")
            .apply(lambda dfg: dfg.sort_values("timestamp").iloc[:-1])
            .sort_values("timestamp")
    )
    print("Finished splitting into train and test sets")

    # Write to files
    df_train.to_csv(TRAIN_FILE, index=False)
    df_test.to_csv(TEST_FILE, index=False)
    print("Finished writing to files")

download_movielens_1m_dataset()
format_and_split_dataset()

print("======== HEADER + FIRST 5 LINES ========")
lines_printed = 0
for line in open(TRAIN_FILE):
    print(line, end='')
    lines_printed += 1
    if lines_printed == 6:
        break

### **2. Using Temporal Features**
If the model learns from each user's entire history of interactions, the model will learn what each user is like *in general*, but the model will fail to capture what the user needs *at the moment*. In order to capture this context, we will specify "temporal tracking features". This allows the model to track the recent movies watched by each user and use this context to make better predictions. Let's run the following block of code and see how much improvement we get after just 3 epochs.

In [None]:
model = bolt.Oracle(
    data_types={
        "userId": bolt.types.categorical(n_unique_classes=6040),
        "movieId": bolt.types.categorical(n_unique_classes=3706),
        "timestamp": bolt.types.date(),
    },
    temporal_tracking_relationships={
        "userId": ["movieId"]
    },
    target="movieId"
)

model.train(TRAIN_FILE, epochs=3, learning_rate=0.0001, metrics=["recall@10"])
model.evaluate(TEST_FILE, metrics=["recall@1", "recall@10", "recall@100"], output_file=PREDICTION_FILE)

Passing recent interactions to a deep learning model is not a new idea – it is intuitive that a model can predict better when it knows the temporal context of the prediction. However, doing so in production requires significant investment in engineering the right data pipeline. With Oracle, you can just add one line in your python script to leverage an efficient data pipeline that pairs perfectly with our sparse deep learning engine. Furthermore, it can operate in a streaming fashion to fit your big data needs.

### **3. More Results**
We summarize our results on three datasets and compare them with the results we got from Tensorflow's two tower recommendation model. Note that the Movielens results are different from the above because we ran Oracle for more epochs in our benchmarks.

|Dataset|Metric|Tensorflow Recommender|Oracle|
| ----------- | ----------- | ----------- | ----------- |
|Amazon Games|recall@1|0.00373|0.052|
||recall@10|0.0501|0.133|
||recall@100|0.138|0.329|
|Movielens 1M|recall@1|0.0|0.054|
||recall@10|0.00563|0.231|
||recall@100|0.159|0.584|
|Netflix 100M|recall@1|0.000444|0.01|
||recall@10|0.00616|0.064|
||recall@100|0.0682|0.267|

### **4. Load and Save**
Like our other autoclassifiers, it is very easy to save an instance of Oracle and load it for inference later.

In [None]:
model.save("saved_model.seq")
loaded_model = bolt.Oracle.load("saved_model.seq")
loaded_model.evaluate(TEST_FILE, metrics=["recall@1", "recall@10", "recall@100"], output_file=PREDICTION_FILE)