# Week 7a: Classifying user intents

This week you will be learning about how to train a text classifier, that will allow you build chatbots that can recognise and respond to user intents. 

To do this you will be training a text classification algorithm, [using the library SetFit](https://github.com/huggingface/setfit) that can be used to train powerful text classifiers with extremely small amounts of data. In this example you will be [using a dataset with three examples per class](class-datasets/basic_intents_train.csv) (a tiny amount of data!!!). **Because this can be trained with so little data, this is a perfect tool for you to adapt to building chatbots for your own projects!** 

The code in this notebook is adapted from: https://hackernoon.com/mastering-few-shot-learning-with-setfit-for-text-classification

Before you get started though, let just make sure that this notebook is setup to run using the `nlp` conda environment that you created at the start of term.

To set this notebook to the right environment, click the **Select kernel** button in the top right corner of this notebook, then select **Python Environments...** and then select the environment `nlp`.

To double check you have done this correctly, hit the run cell button (▶) on the cell below:

In [None]:
import os
print(os.environ['CONDA_DEFAULT_ENV'])

Now you can import the libraries that you will be using for the activity:

In [3]:
import os
from datasets import load_dataset
from sklearn.preprocessing import LabelEncoder
from setfit import SetFitModel, SetFitTrainer
from sentence_transformers.losses import CosineSimilarityLoss

### Load dataset 

This code loads in two datasets, the **training** dataset and the **testing** dataset. 

[The training dataset](class-datasets/basic_intents_train.csv) is what is used to iteratively improve the classifier model. In effect, this is the data that the model 'practices' on and learns from through trial and error.

[The testing dataset](class-datasets/basic_intents_train.csv) is the what we use to evaluate the performance of the classifier after training. It is very important when training a model that you do not train on your testing data. The test data needs to remain unseen by the model, so that you can get an accurate accuracy score for how the model performs on new data.

In [None]:
# Load dataset
dataset = load_dataset('csv', data_files={
    "train": 'class-datasets/basic_intents_train.csv',
    "test": 'class-datasets/basic_intents_test.csv'
})

# Encode labels
le = LabelEncoder()
intent_dataset_train = le.fit_transform(dataset["train"]['label'])
dataset["train"] = dataset["train"].remove_columns("label").add_column("label", intent_dataset_train).cast(dataset["train"].features)

intent_dataset_test = le.fit_transform(dataset["test"]['label'])
dataset["test"] = dataset["test"].remove_columns("label").add_column("label", intent_dataset_test).cast(dataset["test"].features)

### Setup and train classifier

This code creates the classifier object which is called `model` because it is a *statistical model* of the training data. 

the object `trainer` is the code that trains the model, by iterating over data and updating the *model parameters* (aka weights) in the classifier, we can configure the training process by passing in different training *hyperparameters* such as `loss_class`, `batch_size`, `num_iterations` and `num_epochs`. 

The line of code `trainer.train()`, is where the training happens, depending on the size of your dataset this can take a long time, however with such small training datasets as we have here it should not take too long. After training the model is evaluated on the test set with `trainer.evaluate()` and then the weights of the trained model are saved, so that you can use them in the chatbot [week-7c-intent-based-chatbot.py](week-7c-intent-based-chatbot.py).

Don't worry if you don't understand everything that is going on here at the moment, what happens under the hood when machine learning models like this are trained is covered in depth next term in **AI for Media**.

In [None]:
# Initialize model and trainer
model_id = "sentence-transformers/all-mpnet-base-v2"
model = SetFitModel.from_pretrained(model_id)

trainer = SetFitTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=64,
    num_iterations=20,
    num_epochs=2,
    column_mapping={"text": "text", "label": "label"}
)

# Train the model
trainer.train()

# Evaluate the model
evaluation_results = trainer.evaluate()
print("Evaluation Results:", evaluation_results)

os.makedirs('ckpt/', exist_ok=True)

trainer.model._save_pretrained(save_directory="ckpt/")

### Class label mapping

The classifer will only learn to predict integer values for each class, for us to understand what class each number represents you need to manually keep track of them in some-kind of data structure:

> Note: If you adapt this code for your projects and add or change these classes, then you will need to change this data structure as well.

In [5]:
class_label_map = {
    0: "Greeting",
    1: "Farewell",
    2: "Positive Confirmation",
    3: "Negative Confirmation",
    4: "Small Talk",
    5: "Time Enquiry",
    6: "Help",
    7: "Escalation to Human",
    8: "Request Joke"
}

### Test classifier on new input

Now you can test the model using `model.predict()` on a new string input. 

> Note: This functions returns [PyTorch Tensor](https://pytorch.org/docs/stable/generated/torch.tensor.html) (we will be covering exactly what this is in AI for Media next term). This contains an integer that represents the most likely the class index. You can simply cast this as a Python `Int` to get the value in a variable that is easier to work with in Python.

 Try changing the value of `input_text` to get predictions for some different classes, try using a different expression or wording to what is given in [the training data](class-datasets/basic_intents_train.csv):

In [20]:
model = SetFitModel.from_pretrained("ckpt/", local_files_only=True)

input_text = "I need a good laugh"
output = model.predict(input_text)
output_label = int(output)

print(f"Predicted output class: {output_label}, which is intent: '{class_label_map[output_label]}'")

Predicted output class: 8, which is intent: 'Request Joke'


The function above (`model.predict()`) will only tell you the index of the class with the highest confidence probability. Often this is all that you would care about, but sometimes you might want to see what the confidence score is for both the highest rating and the other classes. 

To get that information you can call the function `model.predict_proba()`, which will give an array of floats containing the confidence probabilities for every class in the training dataset.

> Note: This functions also returns [PyTorch Tensor](https://pytorch.org/docs/stable/generated/torch.tensor.html). To convert this into a nice list of Python `Float` variables, you can call the function [.tolist()](https://pytorch.org/docs/stable/generated/torch.Tensor.tolist.html).

This output array only contains the confidence probabilities for each class, the index of the classes is given by the order in the list.

If you want to find out the maximum value and the class index label you need to use functions `max` to get the maximum value in the list and `list.index` to then find the index for the maximum value:

In [21]:
input_text = "I need a good laugh"
output_probs = model.predict_proba(input_text)
output_probs = output_probs.tolist()
print(f'list of class probabilities: {output_probs}')

max_conf = max(output_probs)
max_class = output_probs.index(max_conf)

print(f"Predicted class: {max_class} with confidence {max_conf:.4f} which is intent '{class_label_map[max_class]}'")

list of class probabilities: [0.06979068512279166, 0.07797967909695576, 0.07026228583253973, 0.06973780979281664, 0.08603421715207447, 0.07036901482089156, 0.07528835358312502, 0.07648611531165186, 0.4040518392871534]
Predicted class: 8 with confidence 0.4041 which is intent 'Request Joke'


## Tasks

**Task 1:** Run all the cells in this notebook to [train](#setup-and-train-classifier) and [test](#test-classifier-on-new-input) your classifier.

**Task 2:** [week-7c-Intent-based-chatbot.py](week-7c-intent-based-chatbot.py) has functions already made that respond to all of the intent classes in the dataset. Write some code in `generate_response` that takes the string `processed_input`, feeds it into the classification model and gets the predicted class for the input. Based on the predicted class (0-9) use this to respond with the relevant function based on each intent.

> Tip: There is a comment above each function describing what intent each function responds too. Use this information and the [class label map](#class-label-mapping) to determine which function needs to be called depending on the class prediction.

That's it! Now you can move onto the tasks in [Week-7b-Sentiment-analysis.ipynb](Week-7b-Sentiment-analysis.ipynb). After that feel free to come back to the bonus tasks in here, or you could start adapting the dataset in this classifier to be relevant for your own projects.

### Bonus tasks

**Task A:** Try adding a new intent to the this dataset. Such as asking todays date, asking the chatbot what it's name is, or what it's favourite colour is. Add 3 examples for user intents to the dataset and a new class label, e.g. `10`. After you have trained the model, test it. If it works, add a function to the `IntentChatbot` that responds to the intent and then add this new functionality to the chatbot. 

**Task B:** In `generate_response` use the function `predict_proba` to get the confidences of the predictions from the classifier. If the max prediction is below a certain threshold, i.e. 0.3, then instead of responding based on the max prediction, give a response saying something such as 'I do not understand'. Try out different values for this threshold to see if you can get the best tradeoff between accurately responding to the intents in the dataset but recognising when an input is not in one of those categories and responding with a different response. 

**Task C:** Create a new dataset to adapt this classifier to intents (or other categories you may want to classify based on user inputs) that are relevant to your project. Train the model, test it, and then try importing into the chatbot code for your projects. 