**IMPORTANT!** <br>
Change runtime type to 'GPU' to speed up training process. Don't use 'TPU'


# Import Modules

In [1]:
!pip3 -q install ktrain

[K     |████████████████████████████████| 25.3MB 126kB/s 
[K     |████████████████████████████████| 6.8MB 52.7MB/s 
[K     |████████████████████████████████| 983kB 58.9MB/s 
[K     |████████████████████████████████| 266kB 53.4MB/s 
[K     |████████████████████████████████| 1.9MB 48.0MB/s 
[K     |████████████████████████████████| 1.2MB 48.3MB/s 
[K     |████████████████████████████████| 471kB 51.8MB/s 
[K     |████████████████████████████████| 3.3MB 50.8MB/s 
[K     |████████████████████████████████| 901kB 46.9MB/s 
[?25h  Building wheel for ktrain (setup.py) ... [?25l[?25hdone
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Building wheel for syntok (setup.py) ... [?25l[?25hdone
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Building wheel for keras-bert (setup.py) ... [?25l[?25hdone
  Building wheel for keras-transformer (setup.py) ... [?25l[?25hdone
  Building wheel for keras-pos-embd (setup.py) ... [?25l[?25hdone
  Building whee

In [25]:
# for ktrain.get_predictor.explain to work
!pip -q install git+https://github.com/amaiya/eli5@tfkeras_0_10_1

  Building wheel for eli5 (setup.py) ... [?25l[?25hdone


In [2]:
import pandas as pd
import numpy as np
import ktrain
from collections import Counter
from sklearn.datasets import fetch_20newsgroups

# Load Data

In [3]:
# list of available categories : https://scikit-learn.org/stable/datasets/real_world.html#usage
categories = ['sci.med','sci.electronics']

In [4]:
data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

In [5]:
X_train, X_test, y_train, y_test = data_train.data, data_test.data, data_train.target, data_test.target

In [6]:
X_train[:3]

['From: fulk@cs.rochester.edu (Mark Fulk)\nSubject: Re: Breech Baby Info Needed\nOrganization: University of Rochester\nLines: 89\n\nIn article <1993Apr5.151818.27409@trentu.ca> xtkmg@trentu.ca (Kate Gregory) writes:\n>In article <1993Apr3.161757.19612@cs.rochester.edu> fulk@cs.rochester.edu (Mark Fulk) writes:\n>>\n>>Another uncommon problem is maternal hemorrhage.  I don\'t remember the\n>>incidence, but it is something like 1 in 1,000 or 10,000 births.  It is hard\n>>to see how you could handle it at home, and you wouldn\'t have very much time.\n>>\n>>thing you might consider is that people\'s risk tradeoffs vary.  I consider\n>>a 1/1,000 risk of loss of a loved one to require considerable effort in\n>>the avoiding.\n>\n>Mark, you seem to be terrified of the birth process\n\nThat\'s ridiculous!\n\n>and unable to\n>believe that women\'s bodies are actually designed to do it.\n\nThey aren\'t designed, they evolved.  And, much as it discomforts us, in\nhumans a trouble-free birth proce

In [7]:
y_train[:3]

array([1, 1, 1])

In [8]:
data_train.target_names

['sci.electronics', 'sci.med']

In [9]:
print(len(X_train))
print(len(X_test))
print(Counter(y_train))

1185
789
Counter({1: 594, 0: 591})


# Modeling
List of pre-trained models + descriptions : [link](https://huggingface.co/transformers/pretrained_models.html)

In [10]:
PRETRAINED_MODEL = 'distilbert-base-uncased'
model = ktrain.text.Transformer(PRETRAINED_MODEL, maxlen=512, class_names=data_train.target_names)   # max sentence/sequence length of BERT based model is usually 512

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




In [11]:
train_set = model.preprocess_train(X_train, y_train, verbose=False)
test_set = model.preprocess_test(X_test, y_test, verbose=False)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




In [12]:
classifier = model.get_classifier()
type(classifier)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=363423424.0, style=ProgressStyle(descri…




transformers.models.distilbert.modeling_tf_distilbert.TFDistilBertForSequenceClassification

In [13]:
learner = ktrain.get_learner(classifier, train_data=train_set, val_data=test_set, batch_size=5)
# if out-of-memory error, try reducing 'batch_size' or reducing 'maxlen'

In [14]:
# OPTIONAL - Find learning rate
# learner.lr_find(show_plot=True, max_epochs=2)

# BERT-based models - learning rates between 2e-5 and 5e-5 generally work well

In [15]:
learner.fit_onecycle(lr=5e-5, epochs=4)



begin training using onecycle policy with max lr of 5e-05...
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7faa000f9a90>

# Evaluation

In [18]:
# View data piece with the most loss
learner.view_top_losses(n=3, preproc=model)

----------
id:700 | loss:7.32 | true:sci.med | pred:sci.electronics)

----------
id:517 | loss:7.3 | true:sci.med | pred:sci.electronics)

----------
id:627 | loss:7.2 | true:sci.med | pred:sci.electronics)



In [19]:
print(X_test[700])
# EEG & ECG are bio-tech terms. Most likely there is not much context for the learner to pick up the term. 
# It's a grey area where the context is about bio-tech electronic equipment

From: kcarver@dante.nmsu.edu (Kenneth Carver)
Subject: Isolation amplifiers for EEG/ECG *cheap*
Organization: New Mexico State University, Las Cruces, NM
Lines: 9
Distribution: usa
NNTP-Posting-Host: dante.nmsu.edu

I have several isolation amplifier boards that are the ideal interface
for EEG and ECG.  Isolation is essential for safety when connecting
line-powered equipment to electrodes on the body.  These boards
incorporate the Burr-Brown 3656 isolation module that currently sells
for $133, plus other op amps to produce an overall voltage gain of
350-400.  They are like new and guaranteed good.  $20 postpaid,
schematic included.  Please email me for more data.

--Ken Carver



# Prediction

In [20]:
predictor = ktrain.get_predictor(learner.model, preproc=model)

In [21]:
predictor.predict('Corona pandamic should be calmed when vaccine is widely applied')

'sci.med'

In [34]:
sentence = 'The presence of TPU and GPU has helped data science realm to speed up training process'
print(predictor.predict(sentence))
print(predictor.get_classes())
print(predictor.predict_proba(sentence))

sci.electronics
['sci.electronics', 'sci.med']
[0.99761    0.00238998]


In [26]:
predictor.explain('The presence of TPU and GPU has helped data science realm to speed up training process')
# word with green highlight is the major contributor to the class

Contribution?,Feature
6.076,Highlighted in text (sum)
0.188,<BIAS>


# Save Model & Load Model

In [28]:
PATH_TO_PREDICTOR = './predictor/ktrain_2newsgroup'

In [27]:
predictor.save(PATH_TO_PREDICTOR)

In [29]:
model_loaded = ktrain.load_predictor(PATH_TO_PREDICTOR)

In [31]:
model_loaded.predict('Corona pandamic should be calmed when vaccine is widely applied')

'sci.med'