# In this notebook, we have discussed implementing the Multiclass text classification with [Distilbert Model](https://arxiv.org/pdf/1910.01108.pdf) using [ktrain library](https://pypi.org/project/ktrain/)

1.   Multiclass text classification: It is the problem of classifying text 
instances (documents/query/tickets) into one of three or more classes.
<br> <br> Example: Classifying queries raised by bank customers into one of the relevant 
categories like loan related, credit card related, debit card related etc. 
<br><br> Note: Text classification is an example of supervised machine learning.


2. Distilbert model: It is deep learning based general purpose language model. It is pretrained on large corpus on text data. 

3.   Ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models.


# Import required libraries

In [None]:
import ktrain 
from ktrain import text
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import dotenv
import os
from pathlib import Path

# In order to get reproducible results, set random seed. 

In [None]:
def reset_random_seeds(seed=2):
    os.environ['PYTHONHASHSEED']=str(seed)
    tf.random.set_seed(seed)
    np.random.seed(seed)
    
reset_random_seeds() 

Follow the link to enable GPU 
https://learnopencv.com/how-to-use-opencv-dnn-module-with-nvidia-gpu-on-windows/

# If GPU is available then below code will enable GPU to be used by model training, otherwise model will be trained on CPU. 

In [None]:
physical_devices = tf.config.list_physical_devices('GPU') 
if len(physical_devices)>0:
  tf.config.experimental.set_memory_growth(physical_devices[0], True)

# Read the data that needs to get trained

In [None]:
dotenv.load_dotenv() 
train_str = "train_data.csv"
data_dir = Path(os.getenv('csv_input_dir'))  
save_dir = Path(os.getenv('output_model_dir'))

train_data = pd.read_csv(data_dir/train_str,index_col =0)

In [None]:
train_data.head()

In [None]:
Column_names = set(train_data["Label"])

In [None]:
columns = list(Column_names)

In order to pass equal amount of rows related to each tag to train and validation set, shuffle the data.

In [None]:
train_data = train_data.sample(frac = 1).reset_index()
train_data.drop(["index"],inplace=True,axis=1)

The input data(training data) is divided into train and test(can also be called as validation data) in the ratio 80:20 respectively.
    This is done to evaluate the performence of the model that is trained.

In [None]:
train_size = int(len(train_data) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(train_data) - train_size))

# Train features
description_train = train_data['Content'][:train_size]

# Train labels
labels_train = train_data['Label'][:train_size]

# Test features
description_test = train_data['Content'][train_size:]

# Test labels
labels_test = train_data['Label'][train_size:]

converting the above series objects to 1-D array in order to fit them in KTrain processor

In [None]:
x_train = description_train.values
y_train = labels_train.values
x_test = description_test.values
y_test = labels_test.values

# Loads and preprocess text data from array by text module of ktrain.

KTrain "texts_from_array" is a source that can load texts and associated labels that are in array form

In [None]:
trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                          x_test=x_test, y_test=y_test,
                                          ngram_range=3, 
                                          maxlen=200, 
                                          preprocess_mode='distilbert',
                                          max_features=35000)

Download pre-trained distilbert model for transfer learning.

In [None]:
model = text.text_classifier('distilbert', train_data=trn, preproc=preproc)

In [None]:
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=2)

Print the summary of the model.

In [None]:
model.summary()

In [None]:
learner.fit_onecycle(3e-5, 8) # (learning rate,  epochs =8)

# Test model on test data and calculate evaluation metrics.

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [None]:
pred_result = predictor.predict(x_test)

# Generally multi class classification model is evaluated on following metrics:

1. Accuracy :<br>
The base metric used for model evaluation is often Accuracy, describing the number of correct predictions over all predictions.

2. Precision:<br>
Precision is a measure of how many of the positive predictions made are correct. 

3. Recall:<br>
Recall is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data.

4. F1-Score:<br>
F1-Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two. Harmonic mean is just another way to calculate an “average” of values, generally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean.

Accuracy is not a good measure in case if data having unbalanced classes. <br><br>
Here we are interested in getting high value of accuracy and f1-score both as f1-score can summarise both precision and recall together.  <br><br>
Below code prints all the required metrics.

In [None]:
print(classification_report(
  y_test, 
  pred_result, 
  target_names= columns, 
  zero_division=0
))

# Confusion Matrix

The number of correct and incorrect predictions are summarized with count values and broken down by each class. <br><br>

The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.
<br><br>
Below code prints the confusion matrix for classification model.

In [None]:
confusion_matrix(y_test, pred_result)
c_m = pd.DataFrame(confusion_matrix(y_test, pred_result), columns= columns)
c_m.index=columns
c_m

Save trained model.

In [None]:
predictor.save(save_dir/"model_distilbert")