# Capstone : IT Ticket Classification [SEPT SUN GRP 4B]

![1_yK5G9nHmOD-wrJSRSvPEpw.jpeg](attachment:1_yK5G9nHmOD-wrJSRSvPEpw.jpeg)

# Aim: Automatic Ticket Assignment [Part /6]


Build a classifier that can classify the tickets by analyzing text. Classify incidents to right functional groups can help organizations to reduce the resolving time of the issue and can focus on more productive tasks.

## Pre-Processing, Data Visualization and EDA

- Exploring the given Data files
- Understanding the structure of data
- Missing points in data
- Finding inconsistencies in the data
- Visualizing different patterns
- Visualizing different text features
- Dealing with missing values
- Text preprocessing
- Creating word vocabulary from the corpus of report text data
- Creating tokens as required

## Model Building

- Building a BERT model architecture using Ktrain which can classify the tickets accordingly
- BERT is a method of pre-training language representations, meaning that we train a general-purpose "language understanding" model on a large text corpus (like Wikipedia), and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP. Unsupervised means that BERT was trained using only a plain text corpus, which is important because an enormous amount of plain text data is publicly available on the web in many languages. Pre-trained representations can also either be context-free or contextual, and contextual representations can further be unidirectional or bidirectional. Context-free models such as word2vec or GloVe generate a single "word embedding" representation for each word in the vocabulary, so bank would have the same representation in bank deposit and river bank. Contextual models instead generate a representation of each word that is based on the other words in the sentence. https://github.com/google-research/bert/blob/master/README.md
- Save the model for reload and prediction without having to run the model again.

# Import Libraries

In [None]:
!pip install ktrain

Collecting ktrain
[?25l  Downloading https://files.pythonhosted.org/packages/1d/a3/9bb8c202f8f20171fb54a1991796ea95e4af0020afc13890f4ba4092dede/ktrain-0.20.1.tar.gz (25.3MB)
[K     |████████████████████████████████| 25.3MB 128kB/s 
[?25hCollecting tensorflow==2.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/85/d4/c0cd1057b331bc38b65478302114194bd8e1b9c2bbc06e300935c0e93d90/tensorflow-2.1.0-cp36-cp36m-manylinux2010_x86_64.whl (421.8MB)
[K     |████████████████████████████████| 421.8MB 35kB/s 
Collecting keras_bert>=0.81.0
  Downloading https://files.pythonhosted.org/packages/e2/7f/95fabd29f4502924fa3f09ff6538c5a7d290dfef2c2fe076d3d1a16e08f0/keras-bert-0.86.0.tar.gz
Collecting langdetect
[?25l  Downloading https://files.pythonhosted.org/packages/56/a3/8407c1e62d5980188b4acc45ef3d94b933d14a2ebc9ef3505f22cf772570/langdetect-1.0.8.tar.gz (981kB)
[K     |████████████████████████████████| 983kB 41.8MB/s 
Collecting cchardet
[?25l  Downloading https://files.pythonhosted

In [None]:
#Bert Model using KTrain
import ktrain
from ktrain import text
import pandas as pd 
import numpy as np 
import tensorflow as tf
import sys

In [None]:
np.__version__ , pd.__version__,tf.__version__, print(sys.version_info)

sys.version_info(major=3, minor=6, micro=9, releaselevel='final', serial=0)


('1.18.5', '1.0.5', '2.1.0', None)

# Data Loading

Data File used for Bert DL Model using KTrain contains cleaned data with records for all 74 classes which are unsampled 


In [None]:

from google.colab import drive
drive.mount('/content/drive')

project_path = '/content/drive/My Drive/Colab Notebooks/'
file_name ='itsupportdatacleaned_2.csv'



Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
data=pd.read_csv(project_path+file_name,encoding=sys.getfilesystemencoding()) 
data.head(2).T

Unnamed: 0,0,1
Short description,login issue,outlook
Description,login issue verify user detail employee manage...,outlook hello team meeting skype meeting etc a...
Caller,spxjnwir pjlcoqds,hmjdrvpb komuaywn
Assignment group,GRP_0,GRP_0
New_Assignment_Groups,GRP_0,GRP_0
Text_length,206,194
Dominant_Topic,1,1
Topic_Perc_Contrib,0.9912,0.6753
Keywords,"issue, tool, unable, user, error, work, access...","issue, tool, unable, user, error, work, access..."
Text,"['login', 'issue', 'verify', 'user', 'detail',...","['team', 'meeting', 'appear', 'calendar', 'adv..."


In [None]:
data.dropna(subset=[data.columns[1]], inplace=True)

# Transformer Model (BERT)

## STEP 1:  Load and Preprocess the Data
Preprocess the data using the `texts_from_df function` (since the data resides in an dataframe).

In [None]:
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_df(data, 
                                                                   'Description', # name of column containing review text
                                                                   label_columns=['Assignment group'],
                                                                   maxlen=250, 
                                                                   max_features=10000,
                                                                   preprocess_mode='bert',
                                                                   val_pct=0.1,
                                                                   ngram_range=3)

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


Is Multi-Label? False
preprocessing test...
language: en


## STEP 2:  Load the BERT Model and Instantiate a Learner object

In [None]:
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train), batch_size=10)

Is Multi-Label? False
maxlen is 250
done.


## STEP 3: Train the Model

We train using one of the three learning rates recommended in the BERT paper: *5e-5*, *3e-5*, or *2e-5*.
Alternatively, the ktrain Learning Rate Finder can be used to find a good learning rate by invoking `learner.lr_find()` and `learner.lr_plot()`, prior to training.
The `learner.fit_onecycle` method employs a [1cycle learning rate policy](https://arxiv.org/pdf/1803.09820.pdf).

In [None]:
learner.fit_onecycle(2e-5, 10)



begin training using onecycle policy with max lr of 2e-05...
Train on 7020 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f887e780438>

We can use the `learner.validate` method to test our model against the validation set. As we can see, BERT achieves a **92%** Training accuracy and **66%** Validation accuracy, which is quite a bit higher than the **57%** accuracy achieved by Random Forest DL model - HyperParameter tuned on unsampled data.

In [None]:
y_test_back = [i.argmax() for i in y_test]

In [None]:
learner.validate(val_data=(x_test, y_test), class_names=np.unique(y_test_back).all())

              precision    recall  f1-score   support

           0       0.75      0.92      0.83       329
           1       0.00      0.00      0.00         3
           2       0.56      0.56      0.56         9
           3       0.50      0.50      0.50         2
           4       0.57      0.55      0.56        31
           5       0.69      0.48      0.56        23
           6       0.31      0.45      0.37        11
           7       0.00      0.00      0.00         5
           8       0.57      0.50      0.53         8
           9       1.00      1.00      1.00         1
          10       0.83      0.50      0.62        10
          11       0.56      0.23      0.32        22
          12       0.36      0.25      0.29        20
          13       0.00      0.00      0.00         6
          14       0.50      0.40      0.44         5
          15       0.50      1.00      0.67         2
          16       0.33      1.00      0.50         2
          17       0.86    

  _warn_prf(average, modifier, msg_start, len(result))


array([[303,   0,   0, ...,   0,   0,   1],
       [  0,   0,   0, ...,   0,   0,   0],
       [  2,   0,   5, ...,   0,   1,   0],
       ...,
       [  1,   0,   0, ...,   0,   0,   0],
       [  0,   1,   0, ...,   0,  49,   4],
       [  0,   0,   0, ...,   0,  15,   4]])

## How to Use Our Trained BERT Model

We can call the `learner.get_predictor` method to obtain a Predictor object capable of making predictions on new raw data.

In [None]:
predictor = ktrain.get_predictor(learner.model, preproc)

In [None]:
predictor.get_classes()

['GRP_0',
 'GRP_1',
 'GRP_10',
 'GRP_11',
 'GRP_12',
 'GRP_13',
 'GRP_14',
 'GRP_15',
 'GRP_16',
 'GRP_17',
 'GRP_18',
 'GRP_19',
 'GRP_2',
 'GRP_20',
 'GRP_21',
 'GRP_22',
 'GRP_23',
 'GRP_24',
 'GRP_25',
 'GRP_26',
 'GRP_27',
 'GRP_28',
 'GRP_29',
 'GRP_3',
 'GRP_30',
 'GRP_31',
 'GRP_32',
 'GRP_33',
 'GRP_34',
 'GRP_35',
 'GRP_36',
 'GRP_37',
 'GRP_38',
 'GRP_39',
 'GRP_4',
 'GRP_40',
 'GRP_41',
 'GRP_42',
 'GRP_43',
 'GRP_44',
 'GRP_45',
 'GRP_46',
 'GRP_47',
 'GRP_48',
 'GRP_49',
 'GRP_5',
 'GRP_50',
 'GRP_51',
 'GRP_52',
 'GRP_53',
 'GRP_54',
 'GRP_55',
 'GRP_56',
 'GRP_57',
 'GRP_58',
 'GRP_59',
 'GRP_6',
 'GRP_60',
 'GRP_61',
 'GRP_62',
 'GRP_63',
 'GRP_64',
 'GRP_65',
 'GRP_66',
 'GRP_67',
 'GRP_68',
 'GRP_69',
 'GRP_7',
 'GRP_70',
 'GRP_71',
 'GRP_72',
 'GRP_73',
 'GRP_8',
 'GRP_9']

In [None]:
data['Description'].iloc[1013]

'netzteil oder netzstecker defekt pc wareneingang bitte netzteil oder netzstecker pc evhw wareneingang pr fen und ggf reparieren pc sst sich nur nach bewegen des steckers einschalten'

In [None]:
data['Assignment group'].iloc[1013]

'GRP_33'

In [None]:
#predictor.predict(test_data.iloc[1484].astype(str).values.tolist())
predictor.predict(data['Description'].iloc[1013])

'GRP_33'

The `predictor.save` and `ktrain.load_predictor` methods can be used to save the Predictor object to disk and reload it at a later time to make predictions on new data.

In [None]:
# let's save the predictor for later use
#predictor.save(os.path.join(capstone_project_path,'/my_predictor'))
predictor.save(project_path+'my_predictor')

In [None]:
# reload the predictor
reloaded_predictor = ktrain.load_predictor(project_path+'my_predictor')

In [None]:
reloaded_predictor.predict(data['Description'].iloc[1311])

'GRP_3'

In [None]:
print(data['Assignment group'].iloc[1311])

GRP_3
