
## Workflow: 
1. Import Data
2. Data preprocessing and downloading BERT
3. Training and validation
4. Saving the model


In [1]:
# install ktrain on Google Colab
!pip3 install ktrain

Collecting ktrain
  Using cached ktrain-0.37.6-py3-none-any.whl
Collecting matplotlib>=3.0.0 (from ktrain)
  Using cached matplotlib-3.7.2-cp311-cp311-win_amd64.whl (7.5 MB)
Collecting fastprogress>=0.1.21 (from ktrain)
  Using cached fastprogress-1.0.3-py3-none-any.whl (12 kB)
Collecting langdetect (from ktrain)
  Using cached langdetect-1.0.9-py3-none-any.whl
Collecting jieba (from ktrain)
  Using cached jieba-0.42.1-py3-none-any.whl
Collecting cchardet (from ktrain)
  Using cached cchardet-2.1.7-cp311-cp311-win_amd64.whl
Collecting chardet (from ktrain)
  Using cached chardet-5.2.0-py3-none-any.whl (199 kB)
Collecting syntok>1.3.3 (from ktrain)
  Using cached syntok-1.4.4-py3-none-any.whl (24 kB)
Collecting tika (from ktrain)
  Using cached tika-2.6.0-py3-none-any.whl
Collecting transformers>=4.17.0 (from ktrain)
  Using cached transformers-4.32.0-py3-none-any.whl (7.5 MB)
Collecting sentencepiece (from ktrain)
  Using cached sentencepiece-0.1.99-cp311-cp311-win_amd64.whl (977 kB)
C


[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: C:\Users\SOUVIK\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


In [1]:
import pandas as pd
import numpy as np

import ktrain
from ktrain import text

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


## 1. Import Data

In [4]:
data_train = pd.read_csv('souvikghosh2000/EmoBot/data/data_test.csv', encoding='utf-8')
data_test = pd.read_csv('souvikghosh2000/EmoBot/data/data_train.csv', encoding='utf-8')

X_train = data_train.Text.tolist()
X_test = data_test.Text.tolist()

y_train = data_train.Emotion.tolist()
y_test = data_test.Emotion.tolist()

data = data_train.append(data_test, ignore_index=True)

class_names = ['joy', 'sadness', 'fear', 'anger', 'neutral']

print('size of training set: %s' % (len(data_train['Text'])))
print('size of validation set: %s' % (len(data_test['Text'])))
print(data.Emotion.value_counts())

data.head(10)

FileNotFoundError: [Errno 2] No such file or directory: 'souvikghosh2000/EmoBot/data/data_test.csv'

In [None]:
encoding = {
    'joy': 0,
    'sadness': 1,
    'fear': 2,
    'anger': 3,
    'neutral': 4
}

# Integer values for each class
y_train = [encoding[x] for x in y_train]
y_test = [encoding[x] for x in y_test]

## 2. Data preprocessing

* The text must be preprocessed in a specific way for use with BERT. This is accomplished by setting preprocess_mode to ‘bert’. The BERT model and vocabulary will be automatically downloaded

* BERT can handle a maximum length of 512, but let's use less to reduce memory and improve speed. 

In [6]:
(x_train,  y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
                                                                       x_test=X_test, y_test=y_test,
                                                                       class_names=class_names,
                                                                       preprocess_mode='bert',
                                                                       maxlen=350, 
                                                                       max_features=35000)

downloading pretrained BERT model (uncased_L-12_H-768_A-12.zip)...
[██████████████████████████████████████████████████]
extracting pretrained BERT model...
done.

cleanup downloaded zip...
done.

preprocessing train...
language: en


preprocessing test...
language: en


## 2. Training and validation


Loading the pretrained BERT for text classification 

In [7]:
model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)

Is Multi-Label? False
maxlen is 350
done.


Wrap it in a Learner object

In [None]:
learner = ktrain.get_learner(model, train_data=(x_train, y_train), 
                             val_data=(x_test, y_test),
                             batch_size=6)

Train the model. More about tuning learning rates [here](https://github.com/amaiya/ktrain/blob/master/tutorial-02-tuning-learning-rates.ipynb)

In [9]:
learner.fit_onecycle(2e-5, 3)



begin training using onecycle policy with max lr of 2e-05...
Train on 7934 samples, validate on 3393 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7ffa776ace10>

Validation

In [10]:
learner.validate(val_data=(x_test, y_test), class_names=class_names)

              precision    recall  f1-score   support

         joy       0.87      0.85      0.86       707
     sadness       0.84      0.79      0.82       676
        fear       0.86      0.87      0.86       679
       anger       0.81      0.80      0.81       693
     neutral       0.78      0.85      0.81       638

    accuracy                           0.83      3393
   macro avg       0.83      0.83      0.83      3393
weighted avg       0.83      0.83      0.83      3393



array([[598,   8,  15,  13,  73],
       [ 18, 537,  37,  54,  30],
       [ 16,  20, 590,  40,  13],
       [ 19,  49,  35, 557,  33],
       [ 37,  24,  12,  24, 541]])

#### Testing with other inputs

In [11]:
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.get_classes()

['joy', 'sadness', 'fear', 'anger', 'neutral']

In [16]:
import time 

message = 'I just broke up with my boyfriend'

start_time = time.time() 
prediction = predictor.predict(message)

print('predicted: {} ({:.2f})'.format(prediction, (time.time() - start_time)))

predicted: sadness (0.06)


## 4. Saving Bert model


In [None]:
# let's save the predictor for later use
predictor.save("models/bert_model")