# Bolt TextClassifier Demo

The bolt TextClassifier is an easy way to quickly develop, train, and test a model to perform classification and categorization of text data. 

In [1]:
from thirdai import bolt
import datasets
import random
import os

In [4]:
# Define Helper Functions 
def write_dataset_to_csv(dataset, filename, return_labels=False):
    label_names = dataset.features["intent"].names

    data = []
    for item in dataset:
        sentence = item["text"]
        label = item["intent"]
        label_name = label_names[label]
        data.append((sentence, label_name))

    random.shuffle(data)

    with open(filename, "w") as file:
        file.write('"text","category"\n')
        lines = [
            '"{}","{}"\n'.format(sentence, label_name) for sentence, label_name in data
        ]
        file.writelines(lines)

    if return_labels:
        labels = [x[1] for x in data]
        return labels


def download_clinc_dataset(train_filename, test_filename):
    clinc_dataset = datasets.load_dataset("clinc_oos", "plus")
    write_dataset_to_csv(clinc_dataset["train"], train_filename)
    labels = write_dataset_to_csv(clinc_dataset["test"], test_filename, return_labels=True)

    return (clinc_dataset["train"].features["intent"].num_classes, labels)



def compute_accuracy(test_labels, pred_file):
    with open(pred_file) as pred:
        predictions = pred.readlines()

    correct = 0
    total = 0
    assert len(predictions) == len(test_labels)
    for (prediction, answer) in zip(predictions, test_labels):
        if prediction[:-1] == answer:
            correct += 1
        total += 1

    return correct / total

# Download the Dataset

In [6]:
TRAIN_FILE = "./clinc_train.csv"
TEST_FILE = "./clinc_test.csv"
PREDICTION_FILE = "./clinc_predictions.txt"

(n_classes, test_labels) = download_clinc_dataset(TRAIN_FILE, TEST_FILE)

Reusing dataset clinc_oos (/Users/nmeisburger/.cache/huggingface/datasets/clinc_oos/plus/1.0.0/abcc41d382f8137f039adc747af44714941e8196e845dfbdd8ae7a7e020e6ba1)


  0%|          | 0/3 [00:00<?, ?it/s]

# Defining the TextClassifier model

## Arguments
- `model_size`: Controls how big the model is. Options are ‘small’, ‘medium’,  ‘large’, or a target model size in gigabytes, i.e. “4 Gb” or “4Gb”.
- `n_classes`: How many output classes are in the dataset.
- `input_dim`: The range of the input. This determines what range the pairgrams are hashed to. This is an optional parameter that defaults to 100,000.

In [7]:
classifier = bolt.TextClassifier(model_size="small", n_classes=n_classes)

FullyConnected: dim=670, load_factor=0.2, act_func=ReLU, sampling: {hashes_per_table=3, num_tables=102, range_pow=9, reservoir_size=4}
FullyConnected: dim=151, load_factor=1, act_func=Softmax
Initialized Network in 0 seconds


# Training the TextClassifier

## Arguments
- `train_file`: The dataset to train on. The expected format is a csv file with two columns. The columns should be named “text” and “category”. The category column will contain the labels, the “text” column should contain the sentence. The columns should be column separated. Example:
    
    ```json
    “text”,”category”
    “the red dog ran up the hill”,”action"
    ```
    
- `epochs`: This is an optional parameter that determines the number of epochs to train on. The default value is 1. If there is 1 epoch then it uses a streaming dataset. If the number of epochs is greater than 1 then it loads the dataset into memory and trains for the specified number of epochs.
- `learning_rate`: This is an optional parameter to control the learning rate. Default value is 0.001.

In [8]:
classifier.train(train_file=TRAIN_FILE, epochs=7, learning_rate=0.01)


Epoch 1:
Processed 60 training batches in 2 seconds

Epoch 2:
Processed 60 training batches in 2 seconds

Epoch 3:
Processed 60 training batches in 2 seconds

Epoch 4:
Processed 60 training batches in 2 seconds

Epoch 5:
Processed 60 training batches in 2 seconds

Epoch 6:
Processed 60 training batches in 2 seconds

Epoch 7:
Processed 60 training batches in 2 seconds


# Testing The Classifier
## Arguments
- `test_file`: The test dataset to run. The expected format is the same as train file in the train method.
- `output_file`: This is an optional parameter. If it is specified then as the data is processed it writes the names of the predicted classes into this file. Each predicted class name is on its own line.

In [9]:
classifier.predict(test_file=TEST_FILE, output_file=PREDICTION_FILE)

Accuracy: 0.815273 (4484/5500)


## Run evaluation on predictions
Here we are just recomputing the accuracy to demonstrate use of the output file.

In [10]:
compute_accuracy(test_labels, PREDICTION_FILE)

0.8152727272727273

In [None]:
# Cleanup files
os.remove(TRAIN_FILE)
os.remove(TEST_FILE)
os.remove(PREDICTION_FILE)