# In this notebook, we have discussed implementing the Multiclass text classification with [Distilbert Model](https://arxiv.org/pdf/1910.01108.pdf) using [ktrain library](https://pypi.org/project/ktrain/)

1.   Multiclass text classification: It is the problem of classifying text 
instances (documents/query/tickets) into one of three or more classes.
<br> <br> Example: Classifying queries raised by bank customers into one of the relevant 
categories like loan related, credit card related, debit card related etc. 
<br><br> Note: Text classification is an example of supervised machine learning.


2. Distilbert model: It is deep learning based general purpose language model. It is pretrained on large corpus on text data. 

3.   Ktrain is a lightweight wrapper for the deep learning library TensorFlow Keras (and other libraries) to help build, train, and deploy neural networks and other machine learning models.


# Import required libraries

In [2]:
import ktrain 
from ktrain import text
import tensorflow as tf
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
import dotenv
import os
from pathlib import Path
pd.set_option("display.max_rows", None)
from sklearn.utils.class_weight import compute_class_weight
from keras.callbacks import ModelCheckpoint, EarlyStopping

# In order to get reproducible results, set random seed. 

In [3]:
def reset_random_seeds(seed=2):
    os.environ['PYTHONHASHSEED']=str(seed)
    tf.random.set_seed(seed)
    np.random.seed(seed)
    
reset_random_seeds() 

Follow the link to enable GPU 
https://learnopencv.com/how-to-use-opencv-dnn-module-with-nvidia-gpu-on-windows/

# If GPU is available then below code will enable GPU to be used by model training, otherwise model will be trained on CPU. 

In [4]:
physical_devices = tf.config.list_physical_devices('GPU') 
if len(physical_devices)>0:
  tf.config.experimental.set_memory_growth(physical_devices[0], True)

# Read the data that needs to get trained

In [5]:
dotenv.load_dotenv() 
train_str = "train_data.csv"
data_dir = Path(os.getenv('csv_input_dir'))  
save_dir = Path(os.getenv('output_model_dir'))

train_data = pd.read_csv(data_dir/train_str,index_col =0)

In [6]:
Column_names = set(train_data["Label"])

In [7]:
columns = list(Column_names)

In [8]:
columns

['Field Services - Flint MI',
 'Data Analysis and Reporting',
 'Service Desk - Enterprise Tier - 3',
 'PA Harrisburg - IT',
 'Field Services - El Paso TX - Fed',
 'Cloud Computing Services - Windows',
 'Service Desk - Enterprise',
 'Telephony Provisioning',
 'Premier Services - Enterprise',
 'LSA - Federal',
 'LSA - Sandy UT',
 'SharePoint',
 'Field Services - FEMA Call Centers',
 'Field Services - Memphis',
 'Cherwell Admins',
 'ISO - Physical Security',
 'Field Services - Boston',
 'LSA - Hattiesburg MS',
 'Federal CE Solutions',
 'Field Services - Chicago, IL Wacker',
 'MECM Admin',
 'CA FSS Application Support',
 'Lansweeper Group',
 'Field Services - Chicago Wabash ILCSE',
 'Field Services - Winchester KY - CCO - Fed',
 'ISO - Threat and Vulnerability',
 'Messaging',
 'Field Services - Milwaukee',
 'Field Services - AidVantage',
 'LSA - Brownsville TX',
 'Database',
 'ERP - Back Office',
 'Field Services - Lynn Haven FL\xa0- CCO -Fed',
 'Field Services - Brownsville - Fed - Boca C

In order to pass equal amount of rows related to each tag to train and validation set, shuffle the data.

In [9]:
train_data = train_data.sample(frac = 1).reset_index()
train_data.drop(["index"],inplace=True,axis=1)

KeyError: "['index'] not found in axis"

The input data(training data) is divided into train and test(can also be called as validation data) in the ratio 80:20 respectively.
    This is done to evaluate the performence of the model that is trained.

In [11]:
train_size = int(len(train_data) * .8)
print ("Train size: %d" % train_size)
print ("Test size: %d" % (len(train_data) - train_size))

# Train features
description_train = train_data['Content'][:train_size]

# Train labels
labels_train = train_data['Label'][:train_size]

# Test features
description_test = train_data['Content'][train_size:]

# Test labels
labels_test = train_data['Label'][train_size:]

Train size: 101729
Test size: 25433


converting the above series objects to 1-D array in order to fit them in KTrain processor

In [12]:
x_train = description_train.values
y_train = labels_train.values
x_test = description_test.values
y_test = labels_test.values

In [13]:
y_train

array(['Field Services - FEMA Call Centers', 'CCO Training Applications',
       'Service Desk - Enterprise Tier - 2', ..., 'ERP - Back Office',
       'LSA - London KY', 'Field Services - Baltimore Fed - ERate'],
      dtype=object)

# Loads and preprocess text data from array by text module of ktrain.

KTrain "texts_from_array" is a source that can load texts and associated labels that are in array form

In [14]:
trn, val, preproc = text.texts_from_array(x_train=x_train, y_train=y_train,
                                          x_test=x_test, y_test=y_test,
                                          ngram_range=3, 
                                          maxlen=200, 
                                          preprocess_mode='distilbert',
                                          max_features=35000)

preprocessing train...
language: en
train sequence lengths:
	mean : 13
	95percentile : 25
	99percentile : 33


Is Multi-Label? False
preprocessing test...
language: en
test sequence lengths:
	mean : 13
	95percentile : 25
	99percentile : 33


task: text classification


Download pre-trained distilbert model for transfer learning.

In [15]:
model = text.text_classifier('distilbert', train_data=trn, preproc=preproc)

Is Multi-Label? False
maxlen is 200
done.


In [16]:
train_labels = train_data.Label

In [17]:
class_weights = compute_class_weight(class_weight = "balanced", classes= np.unique(train_labels), y= train_labels)


weights={}
for index, weight in enumerate(class_weights) :
  weights[index]=weight

print(weights)



{0: 0.5365485232067511, 1: 0.5365485232067511, 2: 0.5365485232067511, 3: 0.5365485232067511, 4: 14.371835443037975, 5: 14.119697979125029, 6: 1.7572549886683986, 7: 2.579560207724765, 8: 2.424165014488333, 9: 0.5365485232067511, 10: 4.258321612751993, 11: 0.5365485232067511, 12: 0.5365485232067511, 13: 12.194284618335251, 14: 1.0047725153684477, 15: 7.521708269253519, 16: 2.9480688088283027, 17: 12.774964838255977, 18: 8.654008438818565, 19: 12.012280370300397, 20: 2.2171426578791364, 21: 14.904125644631973, 22: 0.5365485232067511, 23: 1.884830877775472, 24: 13.876254910519425, 25: 0.5365485232067511, 26: 12.381888997078871, 27: 0.9053124688527858, 28: 0.6808991411253186, 29: 0.9627066803948883, 30: 0.5365485232067511, 31: 2.5631298879303395, 32: 0.5365485232067511, 33: 1.3617982822506371, 34: 0.5365485232067511, 35: 0.9457377024795847, 36: 0.7806234576237892, 37: 5.437991789257612, 38: 15.185335562455219, 39: 1.9774515597300408, 40: 3.7088607594936707, 41: 0.5365485232067511, 42: 1.08

In [None]:
learner =  ktrain.get_learner(model, train_data=trn, val_data=val,batch_size=6)

# you can find the learning rate
#learner.lr_find(show_plot=True,max_epochs=5)

# you can use the learning rate obtained and run for 32 epochs
learner.autofit(2e-5, 32,class_weight = weights,callbacks=[EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001)])



begin training using triangular learning rate policy with max lr of 2e-05...
Epoch 1/32
  126/16955 [..............................] - ETA: 11:20:10 - loss: 4.9693 - accuracy: 0.0053

Print the summary of the model.

In [17]:
model.summary()

Model: "tf_distil_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 distilbert (TFDistilBertMai  multiple                 66362880  
 nLayer)                                                         
                                                                 
 pre_classifier (Dense)      multiple                  590592    
                                                                 
 classifier (Dense)          multiple                  121502    
                                                                 
 dropout_19 (Dropout)        multiple                  0         
                                                                 
Total params: 67,074,974
Trainable params: 67,074,974
Non-trainable params: 0
_________________________________________________________________


# Test model on test data and calculate evaluation metrics.

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [None]:
pred_result = predictor.predict(x_test)

# Generally multi class classification model is evaluated on following metrics:

1. Accuracy :<br>
The base metric used for model evaluation is often Accuracy, describing the number of correct predictions over all predictions.

2. Precision:<br>
Precision is a measure of how many of the positive predictions made are correct. 

3. Recall:<br>
Recall is a measure of how many of the positive cases the classifier correctly predicted, over all the positive cases in the data.

4. F1-Score:<br>
F1-Score is a measure combining both precision and recall. It is generally described as the harmonic mean of the two. Harmonic mean is just another way to calculate an “average” of values, generally described as more suitable for ratios (such as precision and recall) than the traditional arithmetic mean.

Accuracy is not a good measure in case if data having unbalanced classes. <br><br>
Here we are interested in getting high value of accuracy and f1-score both as f1-score can summarise both precision and recall together.  <br><br>
Below code prints all the required metrics.

In [None]:
print(classification_report(
  y_test, 
  pred_result, 
  target_names= columns, 
  zero_division=0
))

# Confusion Matrix

The number of correct and incorrect predictions are summarized with count values and broken down by each class. <br><br>

The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.
<br><br>
Below code prints the confusion matrix for classification model.

In [None]:
confusion_matrix(y_test, pred_result)
c_m = pd.DataFrame(confusion_matrix(y_test, pred_result), columns= columns)
c_m.index=columns
c_m

Save trained model.

In [None]:
predictor.save(save_dir/"model_distilbert")