# **IMPORTANT**
To run this notebook, you need to download the ThirdAI docker container by signing up [here](https://www.thirdai.com/try-bolt/).

## **Sentiment analysis with BOLT**

We will walk through the process of building a sentiment analysis model with BOLT from data preprocessing all the way to inference. This notebook is structured as follows:
1. Selecting and preprocessing the dataset
2. Defining the BOLT network
3. Training the network
4. Inference

### **1. Choosing and preprocessing the dataset**
At our webinar on April 6th, we showed how BOLT reached state-of-the-art accuracy on the [Yelp Reviews](https://github.com/huggingface/datasets/blob/master/datasets/yelp_polarity/yelp_polarity.py) dataset and demonstrated that a model trained on the [Amazon Polarity](https://huggingface.co/datasets/amazon_polarity) dataset can be used for interactive, real-time sentiment analysis. Now, we want to give you a chance to try out BOLT with a dataset of your choice. 

We provided a utility function that converts text datasets into input vectors and saves them in SVM format. The text dataset must be a CSV file where each row follows this format:

\<pos or neg\>,\<text\> 

For example, we can have a training corpus called example_train.csv that contains the following:
```
pos,Had a great time at the webinar.
neg,I hate slow deep learning models.
```
We recommend using a training corpus with at least 500,000 training samples.

In [None]:
from thirdai import dataset

text_vector_dim = 100000 # We will vectorize our samples into 100,000-dimensional sparse vectors.

path_to_train_svm = dataset.tokenize_to_svm(
    filename="/path/to/train_data.csv", # TODO: Change the path to train data
    output_dim=text_vector_dim, 
    train=True)

path_to_test_svm = dataset.tokenize_to_svm(
    filename="/path/to/test_data.csv", # TODO: Change the path to test data
    output_dim=text_vector_dim,
    train=False)


Let's now load the SVM datasets that we just generated.

In [None]:
train_data = dataset.load_bolt_svm_dataset(
    filename=path_to_train_svm, 
    batch_size=256)

test_data = dataset.load_bolt_svm_dataset(
    filename=path_to_test_svm, 
    batch_size=256)

### **2. Defining the BOLT network**
**Layer configuration**

First, we need to define the sequence of layers. In this limited demo version, we only support fully-connected layers, which we define with using `bolt.FullyConnected()`. It takes the following arguments:
* `dim`: Int - The dimension of the layer.
* `load_factor`: Float - The fraction of neurons to use during sparse training and sparse inference. For example, `load_factor`=0.05 means the layer uses 5% of its neurons when processing an individual sample.
* `activation_function`: Bolt activation function - We support three activation functions: `ReLU`, `Softmax` and `Linear`.

**Constructing the network**

We then call the `bolt.Network()` constructor, which takes in the sequence of layer configurations we defined earlier as well as the dimension of the input vectors.

**Network specifications**

The network defined below has the same specifications as the network that we used for sentiment analysis on the [Yelp Reviews dataset](https://github.com/huggingface/datasets/blob/master/datasets/yelp_polarity/yelp_polarity.py) during our April 6th webinar. It is a 202,000 parameter model.

In [None]:
from thirdai import bolt

layers = [
    
    bolt.FullyConnected(
        dim=2000, 
        load_factor=0.2, 
        activation_function=bolt.ActivationFunctions.ReLU),
        
    bolt.FullyConnected(
        dim=2,
        load_factor=1.0, 
        activation_function=bolt.ActivationFunctions.Softmax)     
]

network = bolt.Network(
    layers=layers, 
    input_dim=text_vector_dim)

### 3. Training
**The train() method**

Train the BOLT network by calling the `train()` method, which accepts the following arguments:
* `train_data`: BOLT dataset - The training dataset in a format returned by `dataset.load_bolt_svm_dataset()`.
* `loss_fn`: BOLT loss function - The loss function to minimize. In this demo version, we only support the `bolt.CategoricalCrossEntropyLoss()` loss function.
* `learning_rate`: Float - The learning rate for gradient descent. The default value is 0.0001.
* `epochs`: Int - The number of training epochs (a full cycle through the dataset).
* `verbose` (Optional): Boolean - Set to `True` to print a progress bar, accuracy, and elapsed time for each training epoch. Set to `False` otherwise. `True` by default.

It then returns a dictionary that contains the loss value and elapsed time for each training epoch.


**Training with sparse inference in mind**

If you plan to use sparse inference, we recommend calling the `enable_sparse_inference()` method before the last training epoch for accuracy improvements. For example, if the model trains for 10 epochs, this method should be called after the 9th epoch.


**Saving a trained model**

Simply call the `save()` method, passing in the location of the save file.

In [None]:
network.train(
    train_data=train_data,
    loss_fn=bolt.CategoricalCrossEntropyLoss(), 
    learning_rate=0.0001, 
    epochs=20, 
    verbose=True)

network.enable_sparse_inference()

network.train(
    train_data=train_data,
    loss_fn=bolt.CategoricalCrossEntropyLoss(), 
    learning_rate=0.0001, 
    epochs=1,
    verbose=True)

# network.save(filename="/home/thirdai/work/saved_model")

### **4. Inference**
**The predict() method**

You can do inference by calling the `predict()` method, which accepts the following arguments:
* `test_data`: BOLT dataset - The test dataset in a format returned by `dataset.load_bolt_svm_dataset()`.
* `metrics`: List of strings - Metric to evaluate our prediction. In this demo version, we only support the `"categorical_accuracy"` metric.
* `verbose` (Optional): Boolean - Set to `True` to print a progress bar, accuracy, and inference time. Set to `False` otherwise. `True` by default.

It then returns a tuple of `(predictions, metric_results)`:
* `predictions`: 2-dimensional Numpy array where - The i-th row is the output of the network for the i-th example in the dataset.
* `metric_results`: Dictionary - A dictionary mapping each metric name in `metrics` to a list of values for that metric for each epoch (only one entry if returned by `predict()` method). An "epoch_times" metric is included by default.

**Loading a saved model**

To load a saved model, call the `bolt.Network.load()` method. We commented it out by default so you can just continue from the previous cell, but you can always uncomment it so you don't have to retrain the model the next time you visit this notebook!

In [None]:
# Uncomment the next line to load a saved model.
# network = bolt.Network.load(filename="/home/thirdai/work/saved_model") 

predictions, metric_results = network.predict(
    test_data=test_data, 
    metrics=["categorical_accuracy"], 
    verbose=True)

print(predictions)
print(metric_results)

### **Congratulations! You just mastered BOLT.**
If you face any issue running this notebook, please reach out to us by posting about it on [GitHub Issues](https://github.com/ThirdAILabs/Demos/issues).