This notebook is designed to be run either locally or in Google Colab. For usage in Google Colab, the following files/folders should be uploaded to the default directory:
- utils.py
- /data/
    
Resources:
- https://github.com/huggingface/transformers/blob/master/examples/language-modeling/run_clm.py
- https://github.com/huggingface/transformers/blob/master/examples/text-generation/run_generation.py
- https://huggingface.co/transformers/main_classes/trainer.html
- https://huggingface.co/transformers/training.html

# Install Prerequisites

In [1]:
!mkdir -p data/bbc/politics
!mkdir src/
!mkdir models/

mkdir: data/bbc: No such file or directory
mkdir: src/: File exists


In [None]:
import os
import sys
if 'google.colab' in str(get_ipython()):
    !pip install datasets
    !pip install transformers
else:
    sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(''))))

In [1]:
from importlib import reload
import src.utils
reload(src.utils)

<module 'src.utils' from '/Users/jaipancholi/Code/in-his-shoes/src/utils.py'>

# Load Model

In [2]:
import tensorflow as tf
from src.utils import TransformerLoader

In [3]:
model = 'gpt2'
tokenizer, model = TransformerLoader.from_huggingface(model, framework='tf')

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [4]:
type(model)

transformers.modeling_tf_gpt2.TFGPT2LMHeadModel

# Load and Prepare Data

In [5]:
from src.utils import DataReader
import math

In [6]:
text = DataReader.read_bbc_politics()
text = ' '.join(sentence for sentence in text) # join into large string
tokenized_text = tokenizer.encode(text, return_tensors='tf')
tokenized_text = tokenized_text[0]

/Users/jaipancholi/Code/in-his-shoes/src


In [7]:
# default tokeniser stems
for i in range(10):
    value = tokenized_text[i].numpy()
    print(text.split(' ')[i], value, tokenizer.decode([value]))

Baron 21770  Baron
Kinnock 33304  Kinn
makes 735 ock
Lords 1838  makes
debut 18651  Lords
 8886  debut
Former 220  
Labour 14466  Former
leader 7179  Labour
Neil 3554  leader


In [8]:
# split into chunks
seq_length = 10

features = []
labels = []

# # here labels == features
# for i in range(0, len(tokenized_text) - seq_length, seq_length):
#     features.append(tokenized_text[i:i+seq_length])
#     labels.append(tokenized_text[i:i+seq_length])

# here labels = featurs + 1 (shifted by 1, next token prediction)
examples = []
for i in range(0, len(tokenized_text) - seq_length + 1, seq_length):
    examples.append(tokenized_text[i:i + seq_length])

for ex in examples:
    features.append(ex[:-1])
    labels.append(ex[1:])

In [9]:
BATCH_SIZE = 12
# BUFFER_SIZE = 1000
BUFFER_SIZE = len(features)

dataset = tf.data.Dataset.from_tensor_slices((features, labels)).shuffle(BUFFER_SIZE)
# dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

train_size = math.ceil(len(features) * 0.8)  # 80, 20 split

train_dataset = dataset.take(train_size).batch(BATCH_SIZE, drop_remainder=True)
val_dataset = dataset.skip(train_size).batch(BATCH_SIZE, drop_remainder=True)

# Train

In [12]:
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from src.utils import plot_tensorflow_training_history

In [13]:
# defining our optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)

# definining our loss function
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

# defining our metric which we want to observe
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

# compiling the model
model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])

In [None]:
num_epoch = 10

model_filepath = './models/bbc-politics'

metric = 'val_logits_accuracy'

callbacks = [
    EarlyStopping(monitor=metric, patience=25),
    ModelCheckpoint(f'{model_filepath}', save_best_only=True, save_weights_only=False, monitor=metric)
]

history = model.fit(
    train_dataset,
    epochs=num_epoch, 
    validation_data=(val_dataset), 
#     verbose=1,
    callbacks=callbacks
)

Epoch 1/10
  33/1638 [..............................] - ETA: 27:26 - loss: 4.9873 - output_1_loss: 4.9873 - output_1_accuracy: 0.2199 - output_2_1_accuracy: 7.5181e-04 - output_2_2_accuracy: 0.0015 - output_2_3_accuracy: 7.1256e-04 - output_2_4_accuracy: 0.0021 - output_2_5_accuracy: 0.0016 - output_2_6_accuracy: 0.0013 - output_2_7_accuracy: 7.1049e-04 - output_2_8_accuracy: 6.7069e-04 - output_2_9_accuracy: 4.6400e-04 - output_2_10_accuracy: 0.0012 - output_2_11_accuracy: 5.8838e-04 - output_2_12_accuracy: 0.0011

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-14-94adf2115d15>", line 17, in <module>
    callbacks=callbacks
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 855, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2943, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-14-94adf2115d15>", line 17, in <module>
    callbacks=callbacks
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 855, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2943, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/

In [None]:
graph_configs = {
  'training_loss': 'loss',
  'training_acc': 'logits_accuracy',
  'val_loss': 'val_loss',
  'val_acc': 'val_logits_accuracy',
}
plot_tensorflow_training_history(history, **graph_configs, save_filename='politics.html')

# Generate Output

In [10]:
from src.utils import generate_sequence

In [13]:
text = 'the new year in the meantime we will be studying'
generate_sequence(text, tokenizer, model)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


' the new year in the meantime we will be studying the progress of the economy in various ways.\n\nWe have taken decisions on the issues of capital controls and the budget. We have made decisions to improve working conditions. The government has now taken action'

In [14]:
text = 'the new year in the meantime we will be studying'
generate_sequence(text, tokenizer, model)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


" the new year in the meantime we will be studying them.\n\nWe'll make sure that we have a great holiday season and we are going to be doing our best to make it that way. We want to have fun at all times (and"

In [12]:
text = 'executive faces more than 1 000 similar claims for damages'
generate_sequence(text, tokenizer, model)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


" executive faces more than 1 000 similar claims for damages.\n\nThe ruling marks the end of the government's attempt to build a consensus on legal and policy issues that had long been left unresolved. It also marks a major step forward in the fight against"