<a href="https://colab.research.google.com/github/Liunech/bert-pretrained/blob/main/BertPretrained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is an exercise to use BERT pretrained model for ITSM ticket categorisation.

Commonly, NLP models require a lot of data to train a good model. The amount of data is often estimated in the hundreds of thousands or millions of examples. However, when the data is scarce, one can benefit from the pretrained models. BERT was developed by Google in 2018, and since then, many other pretrained models were released and shared with the public access. One of the most popular python library to work with pretrained models is transformers developed by [Hugging Space](https://huggingface.co/models). Today, it contains thousands of pretrained models. I used two of them as described below.

This code can be run on the cpu. However, it will take a serious amount of time. Instead, I run it in Colab to leverage the free-of-charge GPU. In this case, training takes minutes.

To turn on GPU in Colab, go to Edit -> Notebook settings -> Hardware accelerator. Choose GPU from dropdown menu, press Save.

Check that GPU is found:

In [1]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


In my case, I had already a model based on the sklearn library for ticket classification. My task was to improve the prediction quality of that model. Therefore, I was interested in sticking to sklearn and my solution had to be able to integrate with the old program if needed. Therefore, I found it was the easiest to start with the bert-sklearn wrapper. Down the way, I had to modify the original code of this wrapper to add more pretrained models from the Hugging Space repository which is described below.

In [2]:
import csv
def read_csv(filepath):
    with open(filepath, 'r') as csvfp:
        csvreader = csv.DictReader(csvfp, 
                                   delimiter=',', 
                                   quotechar='"', 
                                   skipinitialspace=True, 
                                   quoting=csv.QUOTE_ALL)
        trainset = [(row['body'], row['category']) for row in csvreader]
    return trainset

In [3]:
import string
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

def _clean_text(text, language="english"):
    text = text.strip().lower()
    stemmer = WordNetLemmatizer()
    nopunc = [char for char in text if char not in string.punctuation]
    nopunc = ''.join([i for i in nopunc if not i.isdigit()])
    nopunc =  [word.lower() for word in nopunc.split() if word not in stopwords.words(language)]
    lemma = ' '.join(stemmer.lemmatize(word) for word in nopunc).strip()
    return lemma if len(lemma)>0 else None

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [4]:
import random
from collections import Counter

def preprocess(data, language='english', cv=6):

  if language not in nltk.corpus.stopwords.fileids():
        language = 'english'

  random.shuffle(data)
  data_cleaned = [(_clean_text(d[0], language), d[1].strip().lower()) for d in data]

  data_cleaned = [d for d in data_cleaned if d[0] is not None]

  category = [d[1] for d in data_cleaned]
  category_counter = Counter(category)
  too_few =  {cat for cat, cat_count in category_counter.items() if int(cat_count)<cv}
  data_cleaned = [d for d in data_cleaned if d[1] not in too_few]

  return data_cleaned

In [5]:
tickets = read_csv('all_tickets.csv')
split = int(len(tickets)*0.05)
testset = preprocess(tickets[:split])
trainset = preprocess(tickets[split:])
print(len(trainset), len(testset))
# read
# clean
# split

46111 2424


In [43]:
categories = [t[1] for t in trainset]
category_counter = Counter(categories)
print(category_counter)

Counter({'4': 32389, '5': 9090, '6': 2497, '7': 862, '11': 599, '8': 236, '9': 191, '3': 136, '1': 66, '12': 45})


I my case, I have to bring this dataset to the data format that was used originally for the old model. There, the input to the model was only the text describing the tickets. Let's start with the **body** column. It will be interesting to see how the prediction changes if I merge the title into body later.

In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline

def _build_production_pipeline():
    pipeline = Pipeline([
        ('vect', CountVectorizer(strip_accents='unicode', ngram_range=(1,2), max_features=10000)),
        ('tfidf', TfidfTransformer()),
        ('cccv', CalibratedClassifierCV(cv = 5,
                    base_estimator=RidgeClassifier())
        )
    ])
    return pipeline

In [19]:
pipeline = _build_production_pipeline()

In [16]:
train_x = [t[0] for t in trainset]
train_y = [int(t[1]) for t in trainset]
print(len(train_x),len(train_y))
category_counter = Counter(train_y)
print(category_counter)

46111 46111
Counter({4: 32389, 5: 9090, 6: 2497, 7: 862, 11: 599, 8: 236, 9: 191, 3: 136, 1: 66, 12: 45})


In [20]:

pipeline.fit(train_x, train_y)

Pipeline(steps=[('vect',
                 CountVectorizer(max_features=10000, ngram_range=(1, 2),
                                 strip_accents='unicode')),
                ('tfidf', TfidfTransformer()),
                ('cccv',
                 CalibratedClassifierCV(base_estimator=RidgeClassifier(),
                                        cv=5))])

In [21]:
from sklearn.model_selection import cross_val_score
cvs = cross_val_score(pipeline, train_x, train_y, cv=5)
cvs

array([0.85308468, 0.85718933, 0.84981566, 0.85404468, 0.84916504])

To istall bert-sklearn, follow the original istructions:

In [7]:
!git clone -b master https://github.com/charles9n/bert-sklearn

Cloning into 'bert-sklearn'...
remote: Enumerating objects: 259, done.[K
remote: Total 259 (delta 0), reused 0 (delta 0), pack-reused 259[K
Receiving objects: 100% (259/259), 516.15 KiB | 2.90 MiB/s, done.
Resolving deltas: 100% (131/131), done.


In [8]:
cd bert-sklearn

/content/bert-sklearn


In [9]:
pip install .

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/bert-sklearn
[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
Collecting boto3
  Downloading boto3-1.24.4-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 5.0 MB/s 
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading s3transfer-0.6.0-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 9.4 MB/s 
[?25hCollecting jmespath<2.0.0,>=0.7.1
  Downloading jmespath-1.0.0-py3-none-any.whl (23 kB)
Collecting botocore<1.28.0,>=1.27.4
  Downloading botocore-1.27.4-py3-none-any.whl (8.9 

I will use a public dataset of IT Support tickets taken from [Kaggle](https://www.kaggle.com/code/aniketg11/support-tickets-classification/data)

In [4]:
from bert_sklearn import BertClassifier
model = BertClassifier()
model.fit(df['body'], df['category'])

Building sklearn text classifier...


100%|██████████| 231508/231508 [00:00<00:00, 2748282.51B/s]


Loading bert-base-uncased model...


100%|██████████| 440473133/440473133 [00:07<00:00, 56554297.04B/s]
100%|██████████| 433/433 [00:00<00:00, 356454.10B/s]


Defaulting to linear classifier/regressor
Loading Pytorch checkpoint

train data size: 43695, validation data size: 4854



  cpuset_checked))


Training  :   0%|          | 0/1366 [00:00<?, ?it/s]

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  ../torch/csrc/utils/python_arg_parser.cpp:1055.)
  next_m.mul_(beta1).add_(1 - beta1, grad)


Validating:   0%|          | 0/607 [00:00<?, ?it/s]


Epoch 1, Train loss: 0.6064, Val loss: 0.4327, Val accy: 85.31%



Training  :   0%|          | 0/1366 [00:00<?, ?it/s]

Validating:   0%|          | 0/607 [00:00<?, ?it/s]


Epoch 2, Train loss: 0.3693, Val loss: 0.4093, Val accy: 86.22%



Training  :   0%|          | 0/1366 [00:00<?, ?it/s]

Validating:   0%|          | 0/607 [00:00<?, ?it/s]


Epoch 3, Train loss: 0.2826, Val loss: 0.4265, Val accy: 86.01%



BertClassifier(do_lower_case=True,
               label_list=array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]))

In [5]:
from bert_sklearn import BertClassifier
model = BertClassifier()
#model.bert_model = "dbmdz/bert-base-italian-xxl-uncased"
model.bert_model = "bert-base-multilingual-cased"
model.bert_model = "bert-base-multilingual-cased"
#model.do_lower_case = True
#model.num_mlp_layers = 1
model.train_batch_size = 8
model.max_seq_length = 384
model.epochs = 5
model.fit(df['description'], df['category'])

ModuleNotFoundError: ignored

In [None]:
testdf = pd.read_csv('test.csv')

In [None]:
tests = model.predict(testdf['description'])

  cpuset_checked))


Predicting:   0%|          | 0/24 [00:00<?, ?it/s]

In [None]:
from sklearn.metrics import classification_report

In [None]:
report = classification_report(tests,testdf['category'])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [None]:
report

'                                         precision    recall  f1-score   support\n\n                                  Altro       0.71      0.55      0.62        31\n                          Backup change       1.00      0.50      0.67         2\n                          Backup report       1.00      1.00      1.00         3\n                         Backup request       0.75      1.00      0.86         3\n                     Dismissione server       1.00      0.92      0.96        13\n                       File system full       0.78      1.00      0.88         7\n            Full Backup per dismissione       0.00      0.00      0.00         0\n                    Gestione FileSystem       0.76      0.87      0.81        15\n                  Grant accesso crontab       1.00      1.00      1.00         1\n             Installazione agent Qualys       1.00      0.67      0.80         3\n                 Installazione software       1.00      1.00      1.00         2\n     Modifica

In [None]:
import pickle

In [None]:
with open('pipeline.pkl', "wb") as fp:
  pickle.dump(model, fp)