# Using BOLT
## Basics.
Let's learn to use the BOLT Python API with an exercise. We'll do a simple image classification task on the MNIST dataset. Given 28 by 28 pixel images of handwritten numbers from 0 through 9, predict which number is being drawn.

In [1]:
# TODO(Geordie): Add download scripts and change to relative path
from thirdai import dataset

mnist_train = dataset.load_bolt_svm_dataset(
    filename="datasets/mnist/mnist", 
    batch_size=256)

mnist_test = dataset.load_bolt_svm_dataset(
    filename="datasets/mnist/mnist.t", 
    batch_size=256)


Read 60000 vectors from datasets/mnist/mnist in 0 seconds
Read 10000 vectors from datasets/mnist/mnist.t in 0 seconds


To perform this task, we want to build a simple neural network with these specifications:
* 784 (28 x 28) input dimension
* A single 1000-dim hidden layer with ReLU
* 10-dim output layer with Softmax

In [None]:
from tensorflow import keras

keras_layers = [
    keras.layers.Dense(
        units=1000, 
        activation='relu', 
        input_shape=(784,)),
        
    keras.layers.Dense(
        units=10, 
        activation='softmax')
]

keras_model = keras.Sequential(layers=keras_layers)

In [2]:
from thirdai import bolt

mnist_layers = [
    bolt.LayerConfig(
        dim=1000, 
        activation_function=bolt.ActivationFunctions.ReLU),
    
    bolt.LayerConfig(
        dim=10, 
        activation_function=bolt.ActivationFunctions.Softmax)
]

mnist_network = bolt.Network(
    layers=mnist_layers, 
    input_dim=784)

Layer: dim=1000, load_factor=1, act_func=ReLU
Layer: dim=10, load_factor=1, act_func=Softmax
Initialized Network in 0 seconds


We now train the network to minimize categorical cross entropy loss and measure our success with the categorical accuracy metric.

In [3]:
mnist_network.train(
    train_data=mnist_train, 
    loss_fn=bolt.CategoricalCrossEntropyLoss(), 
    learning_rate=0.001, 
    epochs=1)

mnist_network.predict(
    test_data=mnist_test, 
    metrics=["categorical_accuracy"], 
    verbose=True)


Epoch 1:
Processed 235 training batches in 3 seconds
Processed 40 test batches in 497 milliseconds
Accuracy: 0.9535 (9535/10000)


({'test_time': [497.0], 'categorical_accuracy': [0.9535]},
 array([[1.6119853e-09, 4.9061071e-09, 9.0842747e-09, ..., 9.9999881e-01,
         7.4112470e-12, 7.1178043e-08],
        [7.8348904e-08, 4.7955339e-04, 7.8698450e-01, ..., 1.3695576e-07,
         5.3275298e-06, 2.8925069e-07],
        [9.4192680e-09, 9.9960941e-01, 9.1716443e-05, ..., 2.2237067e-04,
         1.5032618e-05, 3.1097348e-05],
        ...,
        [2.6579289e-10, 1.1534446e-09, 7.5483649e-07, ..., 8.0246173e-05,
         9.5714662e-05, 2.2119880e-03],
        [6.5210443e-05, 3.2664754e-08, 7.9705176e-09, ..., 8.9524441e-08,
         5.7288981e-04, 1.3805798e-08],
        [4.6801352e-07, 1.0962963e-11, 1.6620123e-06, ..., 2.0000725e-11,
         1.0412909e-09, 7.2178027e-11]], dtype=float32))

## What about bigger models?
One example of a more complicated task that requires a larger network is intent classification. To demonstrate that, we have chosen the CLINC150 dataset. It's a corpus of customer queries mapped to their intentions. For example, the dataset may have a query like "do I have to pay for carry-ons on delta?", and this query is assigned an intent id, so in this case the intent is "carry-on" and it has a unique id.

In [2]:
# TODO(Geordie): Add download scripts and change to relative path
intent_class_train = dataset.load_bolt_svm_dataset(
    filename="datasets/intent_classification/train_shuf.svm", 
    batch_size=256)

intent_class_test = dataset.load_bolt_svm_dataset(
    filename="datasets/intent_classification/test_shuf.svm", 
    batch_size=256)

Read 18100 vectors from datasets/intent_classification/train_shuf.svm in 2 seconds
Read 5500 vectors from datasets/intent_classification/test_shuf.svm in 0 seconds


We converted the samples in this dataset into 5000 dimensional sparse input vectors and we'll use 10000 hidden layer. That's a 51 million parameter model so it's quite a big model and it usually takes 100 seconds to train such a model for just one epoch on CPU with other frameworks. This is where we introduce Bolt's unique offering. Take a look at the configuration for the first layer, and see that in addition to dimension and activation function, we now also have the load factor. It's a knob for setting the network's computational budget so you can have the power of deep learning for cheap. Here, we use 0.05 -> 500 neurons out of 10,000 for each input.

And it's not just any 500 neurons like you would get with something like dropouts. It's the 500 most important neurons for each input sample, so Bolt curates a small network for individual samples.

In [None]:
bigger_layers = [
    bolt.LayerConfig(
        dim=10000, 
        load_factor=0.05, 
        activation_function=bolt.ActivationFunctions.ReLU),
    
    bolt.LayerConfig(
        dim=151, 
        activation_function=bolt.ActivationFunctions.Softmax)
]

bigger_network = bolt.Network(
    layers=bigger_layers, 
    input_dim=5512)

### Sparse inference
You can also use sparsity to accelerate inference. And you can do it with just one method call. Simply call `network.enable_sparse_inference()`, and the next time the model does inference, it will only use the computational budget that you set in the load factor. Unlike other sparse techniques that involve pruning or quantization, Bolt's sparse inference is strongly tied to its sparse training because it's aware of inference sparsity during training, and optimizes for sparse inference directly, leading to better performance.

In [None]:
bigger_network.train(
    train_data=intent_class_train, 
    loss_fn=bolt.CategoricalCrossEntropyLoss(), 
    learning_rate=0.001, 
    epochs=2)

bigger_network.enable_sparse_inference()

bigger_network.train(
    train_data=intent_class_train, 
    loss_fn=bolt.CategoricalCrossEntropyLoss(), 
    learning_rate=0.001, 
    epochs=1)

bigger_network.predict(
    test_data=intent_class_test, 
    metrics=["categorical_accuracy"], 
    verbose=True)

## What does this enable?
Now that we have shown you how well training time scales with increasingly larger models, we want to show you what you can do with an even larger model. This time, we'll do sentiment classification on the Yelp Reviews dataset. So the task is, you take a sentence, a restaurant review in this case, and predict whether it has a positive sentiment, or a negative sentiment.

In [3]:
train_data = dataset.load_bolt_svm_dataset(
    filename="../sa_demo/text_data/yelp_review_full_2class_train.svm", 
    batch_size=1024)

test_data = dataset.load_bolt_svm_dataset(
    filename="../sa_demo/text_data/yelp_review_full_2class_test.svm", 
    batch_size=256)

RuntimeError: Unable to open file '../sa_demo/text_data/yelp_review_full_2class_train.svm'

As a benchmark, we compared it to RoBERTa, the state-of-the-art NLP model that trained for a whole month on 100 gb of data on a fleet of 8 gpus. We then fine-tuned it on this dataset, the state-of-the-art NLP model, on this dataset and got an accuracy of 83%. 

To do the same task with BOLT, we first convert the sentences into 100,000 dimensional sparse vectors. Which is huge! But that's one of the benefits of natively supporting sparsity: you can engineer your features as you like, and capture such a rich feature set from your dataset that you only need to train on one small dataset to build an accurate model. We define the model as follows: hidden layer of 2000 dimensions with a load factor 0.2, followed by a 2 dimensional output layer so we can choose between positive and negative sentiments.

In [None]:
yelp_sentiment_analysis_layers = [
    
    bolt.LayerConfig(dim=2000, 
        load_factor=0.2, 
        activation_function=bolt.ActivationFunctions.ReLU),
    
    bolt.LayerConfig(dim=2,
        load_factor=1.0, 
        activation_function=bolt.ActivationFunctions.Softmax)     
]

yelp_sentiment_analysis_network = bolt.Network(
    layers=yelp_sentiment_analysis_layers, 
    input_dim=100000)

### Load & Save
We trained this on a 10-year old CPU and that took a few minutes, which is fast for a network of this size, but a little long for this demo. So take the chance to introduce our load and save feature. We know that in practice, you want to train a model once, save it, and use it repeatedly, and we hear you. All you have to do to save a trained bolt model is to call the save method with the save file path of your choice.

In [None]:
# TODO(Geordie): Add download scripts and change to relative path


yelp_sentiment_analysis_network.train(
    train_data=train_data,
    loss_fn=bolt.CategoricalCrossEntropyLoss(), 
    learning_rate=0.0001, 
    epochs=20, 
    rehash=6400, 
    rebuild=128000)

yelp_sentiment_analysis_network.save(filename="yelp_sentiment_analysis_cp")

To load a trained model, call the `bolt.Network.load()` static method.

In [None]:
yelp_sentiment_analysis_network = bolt.Network.load(filename="yelp_sentiment_analysis_cp")

## Moment of truth
RoBERTa: 83% accuracy.
Let's see how BOLT does!

In [4]:
# TODO(Geordie): Add download scripts and change to relative path
res = yelp_sentiment_analysis_network.predict(
    test_data=test_data, 
    metrics=["categorical_accuracy"], 
    verbose=True)

NameError: name 'dataset' is not defined

We also trained an even larger 2 billion parameter model on a larger text corpus to build an interactive sentiment analysis demo. We first load the trained model.

In [5]:
# TODO(Geordie): Add download scripts and change to relative path
sentiment_analysis_network = bolt.Network.load(filename="interactive_demo_cp")

Let's load the demo to get a feel of what this network can do!

In [6]:
import interactive_sentiment_analysis
interactive_sentiment_analysis.demo(sentiment_analysis_network, verbose=False)
# TODO(Geordie): Make the accuracy disappear when doing interactive demo 

Processed 1 test batches in 18 milliseconds
Accuracy: 0 (0/1)
positive!
Processed 1 test batches in 6 milliseconds
Accuracy: 0 (0/1)
negative!
Processed 1 test batches in 9 milliseconds
Accuracy: 0 (0/1)
negative!
Processed 1 test batches in 9 milliseconds
Accuracy: 0 (0/1)
positive!
Exiting demo...


### Let's talk speed.
Now that we've seen how fast inference is on BOLT, let's compare it with RoBERTa by running the following cell.

In [None]:
import time
from transformers import pipeline
sentiment_analysis = pipeline("sentiment-analysis",model="siebert/sentiment-roberta-large-english")
t1 = time.time()
out = sentiment_analysis("I love chocolate.")
t2 = time.time()
print(out, flush=True)
print('time elapsed: ',str(t2-t1),'s', flush=True)