# **IMPORTANT**
To run this notebook, you need to download the ThirdAI docker container by signing up [here](https://www.thirdai.com/try-bolt/).

## **Sentiment analysis with BOLT**

We will walk through the process of building a sentiment analysis model with BOLT from data preprocessing all the way to inference. This notebook is structured as follows:
1. Selecting and preprocessing the dataset
2. Defining the BOLT network
3. Training the network
4. Inference

### **1. Choosing and preprocessing the dataset**
At our webinar on April 6th, we showed how BOLT reached state-of-the-art accuracy on the [Yelp Reviews](https://github.com/huggingface/datasets/blob/master/datasets/yelp_polarity/yelp_polarity.py) dataset and demonstrated that a model trained on the [Amazon Polarity](https://huggingface.co/datasets/amazon_polarity) dataset can be used for interactive, real-time sentiment analysis. Now, we want to give you a chance to try out BOLT with a dataset of your choice. 

We provided a utility function that converts text datasets into input vectors and saves them in SVM format. The text dataset must be a CSV file where each row follows this format:

\<pos or neg\>,\<text\> 

For example, we can have a training corpus called example_train.csv that contains the following:
```
pos,Had a great time at the webinar.
neg,I hate slow deep learning models.
```
We recommend using a training corpus with at least 500,000 training samples.

In [None]:
from thirdai import dataset

text_vector_dim = 100_000 # We will vectorize our samples into 100,000-dimensional sparse vectors.

path_to_train_svm = "preprocessed_data_train.svm"
dataset.tokenize_to_svm(
    input_file="/path/to/train_data.csv", # TODO: Change the path to train data
    output_dim=text_vector_dim,
    output_file=path_to_train_svm)

path_to_test_svm = "preprocessed_data_test.svm"
dataset.tokenize_to_svm(
    input_file="/path/to/test_data.csv", # TODO: Change the path to test data
    output_dim=text_vector_dim,
    output_file=path_to_test_svm)


Let's now load the SVM datasets that we just generated.

In [None]:
train_data, train_labels = dataset.load_bolt_svm_dataset(
    filename=path_to_train_svm, 
    batch_size=256)

test_data, test_labels = dataset.load_bolt_svm_dataset(
    filename=path_to_test_svm, 
    batch_size=256)

### **2. Defining the BOLT network**
**Layer configuration**

First, we need to define the sequence of layers. In this limited demo version, we only support fully-connected layers, which we define with using `bolt.graph.Input(), bolt.graph.FullyConnected()`. It takes the following arguments:
* `dim`: Int - The dimension of the layer.
* `sparsity`: Float - The fraction of neurons to use during sparse training and sparse inference. For example, `sparsity`=0.05 means the layer uses 5% of its neurons when processing an individual sample.
* `activation`: Bolt activation function - We support three activation functions: `relu`, `softmax` and `linear`.

**Constructing the network**

We then call the `bolt.graph.Model()` constructor, which takes in the sequence of layer configurations for input and output layer.

**Network specifications**

The network defined below has the same specifications as the network that we used for sentiment analysis on the [Yelp Reviews dataset](https://github.com/huggingface/datasets/blob/master/datasets/yelp_polarity/yelp_polarity.py) during our April 6th webinar. It is a 202,000 parameter model.

In [None]:
from thirdai import bolt

input_layer = bolt.graph.Input(dim=text_vector_dim)

hidden_layer = bolt.graph.FullyConnected(
        dim=2000,
        sparsity=0.2,
        activation="relu",
    )(input_layer)

output_layer = bolt.graph.FullyConnected(dim=2, activation="softmax")(hidden_layer)

model = bolt.graph.Model(inputs=[input_layer], output=output_layer)

### 3. Training

**Model compilation with loss function**

* `loss_fn`: BOLT loss function - The loss function to minimize. In this demo version, we only support the `bolt.CategoricalCrossEntropyLoss()` loss function.

**Creating training config**

* `learning_rate`: Float - The learning rate for gradient descent. The default value is 0.0001.
* `epochs`: Int - The number of training epochs (a full cycle through the dataset).

**The train() method**

Train the BOLT network by calling the `train()` method, which accepts the following arguments:
* `train_data`: BOLT dataset - The training source dataset in a format returned by `dataset.load_bolt_svm_dataset()`.
* `train_label`: BOLT dataset - The training target label in a format returned by `dataset.load_bolt_svm_dataset()`.
* `train_config`: Training Config - The training config provides training parameters to model training`.

It then returns a dictionary that contains the loss value and elapsed time for each training epoch.


**Saving a trained model**

Simply call the `save()` method, passing in the location of the save file.

In [None]:
model.compile(loss=bolt.CategoricalCrossEntropyLoss())

train_config = (
        bolt.graph.TrainConfig.make(learning_rate=0.0001, epochs=20)        
    )

metrics = model.train(
        train_data=train_data, train_labels=train_labels, train_config=train_config
    )
    
model.save(filename="saved_model")

### **4. Inference**

**Defining predict config**

* `with_metrics`: List of strings - Metric to evaluate our prediction. In this demo version, we only support the `"categorical_accuracy"` metric.

**The predict() method**

You can do inference by calling the `predict()` method, which accepts the following arguments:
* `test_data`: BOLT dataset - The test dataset in a format returned by `dataset.load_bolt_svm_dataset()`.
* `test_label`: BOLT dataset - The test label in a format returned by `dataset.load_bolt_svm_dataset()`.
* `predict_config` : predict config with metric definition.

It then returns a dictionary of metric_results:
* `metric_results`: Dictionary - A dictionary mapping each metric name in `metrics` to a list of values for that metric.

**Loading a saved model**

To load a saved model, call the `bolt.graph.Model.load()` method. We commented it out by default so you can just continue from the previous cell, but you can always uncomment it so you don't have to retrain the model the next time you visit this notebook!

In [None]:
predict_config = (
        bolt.graph.PredictConfig.make().with_metrics(["categorical_accuracy"])
    )

metrics = model.predict(
        test_data=test_data, test_labels=test_labels, predict_config=predict_config
    )

### **Congratulations! You just mastered BOLT.**
If you face any issue running this notebook, please reach out to us by posting about it on [GitHub Issues](https://github.com/ThirdAILabs/Demos/issues).