# Bolt TextClassifier Demo

The bolt TextClassifier is an easy way to quickly develop, train, and test a model to perform classification and categorization of text data. 

# Step 0: Setup 
### Optional: Install thirdai, datasets (if not already on system)

In [None]:
!pip3 install thirdai
!pip3 install datasets

### Import libraries

In [None]:
from thirdai import bolt
import datasets
import random
import os

### Define Helper Functions

In [None]:
def write_dataset_to_csv(dataset, filename, return_labels=False):
    label_names = dataset.features["intent"].names

    data = []
    for item in dataset:
        sentence = item["text"]
        label = item["intent"]
        label_name = label_names[label]
        data.append((sentence, label_name))

    random.shuffle(data)

    with open(filename, "w") as file:
        file.write('"text","category"\n')
        lines = [f'"{sentence}","{label_name}"\n' for sentence, label_name in data]
        file.writelines(lines)

    if return_labels:
        labels = [x[1] for x in data]
        return labels


def download_clinc_dataset(train_filename, test_filename):
    clinc_dataset = datasets.load_dataset("clinc_oos", "plus")
    write_dataset_to_csv(clinc_dataset["train"], train_filename)
    labels = write_dataset_to_csv(clinc_dataset["test"], test_filename, return_labels=True)

    return (clinc_dataset["train"].features["intent"].num_classes, labels)



def compute_accuracy(test_labels, pred_file):
    with open(pred_file) as pred:
        predictions = pred.readlines()

    correct = 0
    total = 0
    assert len(predictions) == len(test_labels)
    for (prediction, answer) in zip(predictions, test_labels):
        if prediction[:-1] == answer:
            correct += 1
        total += 1

    return correct / total

### Download the Dataset

In [None]:
TRAIN_FILE = "./clinc_train.csv"
TEST_FILE = "./clinc_test.csv"
PREDICTION_FILE = "./clinc_predictions.txt"

(n_classes, test_labels) = download_clinc_dataset(TRAIN_FILE, TEST_FILE)

# Step 1: Defining the TextClassifier model

### Arguments
- `model_size`: Controls how big the model is. Options are ‘small’, ‘medium’,  ‘large’, or a target model size in gigabytes, i.e. “4 Gb” or “4Gb”.
- `n_classes`: How many output classes are in the dataset.

In [None]:
classifier = bolt.TextClassifier(model_size="2Gb", n_classes=n_classes)

# Step 2: Training the TextClassifier

### Arguments
- `train_file`: The dataset to train on. The expected format is a csv file with two columns. The columns should be named “text” and “category”. The category column will contain the labels, the “text” column should contain the sentence. The columns should be comma separated. The TextClassifier also does not attempt to shuffle the dataset, the user should shuffle the dataset before calling train if necessary. Example:    
    ```
    “text”,”category”
    “the red dog ran up the hill”,”action"
    ```
    
- `epochs`: This determines the number of epochs to train on. If there is 1 epoch then it uses a streaming dataset, otherwise it loads the dataset into memory and trains for the specified number of epochs.
- `learning_rate`: Determines the learning_rate for training.

In [None]:
classifier.train(train_file=TRAIN_FILE, epochs=7, learning_rate=0.01)

# Step 3: Testing The Classifier
### Arguments
- `test_file`: The test dataset to run. The expected format is the same as train file in the train method.
- `output_file`: This is an optional parameter. If it is specified then as the data is processed it writes the names of the predicted classes into this file. Each predicted class name is on its own line.

In [None]:
classifier.predict(test_file=TEST_FILE, output_file=PREDICTION_FILE)

# Alternatively, if no output file is desired
# classifier.predict(test_file=TEST_FILE)

## Running Evaluations on the Predictions
Here we are just recomputing the accuracy to demonstrate use of the output file.

In [None]:
compute_accuracy(test_labels, PREDICTION_FILE)

In [None]:
# Cleanup files
os.remove(TRAIN_FILE)
os.remove(TEST_FILE)
os.remove(PREDICTION_FILE)