<a href="https://colab.research.google.com/github/MosheWasserb/BestPracticeTextClassificationDistillation/blob/main/GPTtoTinyModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**How to use GPT to improve tiny models performance?**

Data Augmentation based GPT (GPT-DA) was not shown to be a silver bullet to improve downstream tasks in low resource scenarios. The main reason is that GPT, although fine-tuned to specific training-data still generates new data that is too similar and is not able to expose the full distribution of the domain. On the other hand, less known is the fact that GPT generated data is very effective for training tiny models (few million parameters). In this notebook, I'll demonstrate how to deploy data-augmentation based GPT-Medium to improve tiny models accuracies.



















The following table summarizes the results of fine-tuning several models on the SST2 dataset (part of the Glue benchamrk). All of the models were fine-tuned using the 66K samples available as SST2 training data in the GLUE benchmark, in addition, the tiny models (Fnet, Transformer, Bi-LSTM) were also trained using augmented data generated by GPT. As we can see the tiny models have shown significant accuracy boost while trained with GPT-DA. In this notebook I will explain how such results can be replicated.    

| **Model**               | **#params[M]** | **Acc**  | **Training Data**          | 
|:-----------------------:|:-------:|:---------------:|:-----------------:|
|  ***tiny-BERT kerasNLP**|    4    |   83.5          | 67K GLUE     
|  ****tiny-BERT distill**|    4    |   83.4.         | 67K GLUE   
|    **DistilBERT**       |    67   |   92.3.         | 67K GLUE     
|    **Fnet**             |    0.74 |   81.5/88.7     | 67K/GPT-DA(800K) 
|    **Transformer**      |    0.79 |   81.2/87.5     | 67K/GPT-DA(800K)
|       **Bi-LSTM**       |    0.66 |   82.9/**91.5** | 67K/GPT-DA(800K)




*tiny-BERT kerasNLP - See https://keras.io/api/keras_nlp/models/ 

**tiny-BERT distill - See https://www.philschmid.de/knowledge-distillation-bert-transformers

Fnet and Transformer network results are taken from https://keras.io/examples/nlp/fnet_classification_with_keras_nlp/

Recommended papers:

1. "A Few More Examples May Be Worth Billions of Parameters" https://arxiv.org/pdf/2110.04374.pdf

2. "Data Augmentation using Pre-trained Transformer Models" https://arxiv.org/abs/2003.02245

3. "GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation" https://arxiv.org/abs/2104.08826

You can download the Bi-LSTM model fine-tuned on GPT-DA from the Hugging Face repository (**https://huggingface.co/moshew/distilbilstm-finetuned-sst-2-english**)

In [None]:
!pip install datasets
!pip install -q --upgrade keras-nlp tensorflow

**Data Augmentation**

Please note that steps 1-2 are only required if you dont have sufficient in-domain un-labeled data. 






1. For fine-tuning GPT model on the SST-2 dataset, Please follow J. Mamou's example here https://github.com/jmamou/data-augment


A GPT-Medium model that was fine-tuned on SST-2 is avilable on the Hugging Face hub (**https://huggingface.co/jmamou/gpt2-medium-SST-2**)

2. In order to Generate samples, see J. Mamou's example here https://github.com/jmamou/data-augment. 
Please note it can take several hours even on strong machines since it is required to generate a large amount of data (~800K samples).

3. After we have generated a large set of un-labeled data, the next step is to label each data sample with the corresponding RoBERTa prediction.  

In [None]:
# You can use the folllowing lines of code from S-BERT to generate the RoBERTa predictions 
#!pip install -U sentence-transformers
#from sentence_transformers import CrossEncoder
#model = CrossEncoder('philschmid/roberta-large-sst2', num_labels=2)
#y_aug = model.predict(list(zip(x_aug)), batch_size=32)

A pre-made GPT SST2 augmented data and corresponding prediction are available in Hugging Face's hub (**jmamou/augmented-glue-sst2**)

**Training and Evaluation**

Now let's start and train our Tiny models

In [2]:
import tensorflow as tf
import os
import numpy as np
from sklearn.metrics import accuracy_score
from tensorflow import keras

from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences

from datasets import load_dataset

keras.utils.set_random_seed(42)

In [17]:
BATCH_SIZE = 512
MAX_SEQUENCE_LENGTH = 64
VOCAB_SIZE = 10000
EMBED_DIM = 64
INTERMEDIATE_DIM = 128

NUM_CLASS = 2 

value2hot = {
  0: [1,0,],
  1: [0,1]
}

In [None]:
#Load pre-made augmented data and RoBERTa's predictions 
from datasets import load_dataset

# Training with GPT-2's augmented data
sst2_aug = load_dataset("jmamou/augmented-glue-sst2") 
y_train = np.array(sst2_aug['train']['prediction'])
x_train=sst2_aug['train']['sentence']

# If you want to train with GLUE dataset un-masked the following code lines
#sst2_glue = load_dataset("glue","sst2") 
#y_train = np.array([value2hot[l] for l in sst2_glue['train']['label']])
#x_train=sst2_glue['train']['sentence']

In [None]:
#Load test/validation SST2 data
sst2 = load_dataset("SetFit/sst2")

y_val = np.array([value2hot[l] for l in sst2['validation']['label']])
y_test = np.array([value2hot[l] for l in sst2['test']['label']])

x_val=sst2['validation']['text']
x_test=sst2['test']['text']

Test Bert-tiny fine-tuned on SST2

In [None]:
from sklearn.metrics import accuracy_score
import keras_nlp
classifier = keras_nlp.models.BertClassifier.from_preset("bert_tiny_en_uncased_sst2")
y_pred = np.argmax(classifier.predict(x_test), axis=1)
accuracy_score(y_pred, sst2['test']['label'])

Simple word tokenizer

In [18]:
# Tokenize our training data
tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(x_train)
word_index = tokenizer.word_index
num_words = min(VOCAB_SIZE, len(word_index) + 1)

In [19]:
def tokenizer_padding(sentence):
    sentence_sequences = tokenizer.texts_to_sequences(sentence)
    sentence_padded = pad_sequences(sentence_sequences, padding='post', truncating='post', maxlen=MAX_SEQUENCE_LENGTH)
    return (sentence_padded)

X_train = tokenizer_padding(x_train)
X_val = tokenizer_padding(x_val)
X_test = tokenizer_padding(x_test)

Test FNet

In [None]:
input_ids = keras.Input(shape=(None,), dtype="int64", name="input_ids")

x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=True,
)(input_ids)

x = keras_nlp.layers.FNetEncoder(intermediate_dim=INTERMEDIATE_DIM)(inputs=x)
x = keras_nlp.layers.FNetEncoder(intermediate_dim=INTERMEDIATE_DIM)(inputs=x)
x = keras_nlp.layers.FNetEncoder(intermediate_dim=INTERMEDIATE_DIM)(inputs=x)


x = keras.layers.GlobalAveragePooling1D()(x)
x = keras.layers.Dropout(0.1)(x)
outputs = keras.layers.Dense(NUM_CLASS, activation="softmax")(x)

fnet_classifier = keras.Model(input_ids, outputs, name="fnet_classifier")

fnet_classifier.summary()
fnet_classifier.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="KLD",
    metrics=["accuracy"],
)

fnet_classifier.fit(X_train, y_train, epochs=3, validation_data=[X_test, y_test], batch_size=BATCH_SIZE, shuffle=True)

Model: "fnet_classifier"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_ids (InputLayer)      [(None, None)]            0         
                                                                 
 token_and_position_embeddin  (None, None, 64)         644096    
 g (TokenAndPositionEmbeddin                                     
 g)                                                              
                                                                 
 f_net_encoder (FNetEncoder)  (None, None, 64)         16832     
                                                                 
 f_net_encoder_1 (FNetEncode  (None, None, 64)         16832     
 r)                                                              
                                                                 
 f_net_encoder_2 (FNetEncode  (None, None, 64)         16832     
 r)                                                

<keras.callbacks.History at 0x7fe7c640af10>

Transformer

In [None]:
NUM_HEADS = 2
input_ids = keras.Input(shape=(None,), dtype="int64", name="input_ids")


x = keras_nlp.layers.TokenAndPositionEmbedding(
    vocabulary_size=VOCAB_SIZE,
    sequence_length=MAX_SEQUENCE_LENGTH,
    embedding_dim=EMBED_DIM,
    mask_zero=False,
)(input_ids)

x = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
x = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)
x = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=INTERMEDIATE_DIM, num_heads=NUM_HEADS
)(inputs=x)


x = keras.layers.GlobalAveragePooling1D()(x)
x = keras.layers.Dropout(0.1)(x)
outputs = keras.layers.Dense(NUM_CLASS, activation="softmax")(x)

transformer_classifier = keras.Model(input_ids, outputs,name="transformer_classifier")


transformer_classifier.summary()
transformer_classifier.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="KLD",
    metrics=["accuracy"],
)
transformer_classifier.fit(X_train, y_train, epochs=3, validation_data=[X_test, y_test], batch_size=BATCH_SIZE, shuffle=True)

Model: "transformer_classifier"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_ids (InputLayer)      [(None, None)]            0         
                                                                 
 token_and_position_embeddin  (None, None, 64)         644096    
 g_1 (TokenAndPositionEmbedd                                     
 ing)                                                            
                                                                 
 transformer_encoder (Transf  (None, None, 64)         33472     
 ormerEncoder)                                                   
                                                                 
 transformer_encoder_1 (Tran  (None, None, 64)         33472     
 sformerEncoder)                                                 
                                                                 
 transformer_encoder_2 (Tran  (None, None, 6

<keras.callbacks.History at 0x7fe7219ff220>

Bi-LSTM + Glove

In [None]:
#Load Glove's files
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip glove.6B.50d.txt

In [20]:
#load Glove 
# define dict to hold a word and its vector
glove = {}
# read the word embeddings file ~820MB
f = open('glove.6B.50d.txt', encoding='utf-8')

for line in f:
  values = line.split()
  word = values[0]
  coefs = np.asarray(values[1:], dtype='float32')
  glove[word] = coefs
f.close()
# check the length
len(glove) # 400000

j=0
embedding_matrix = np.zeros((num_words, 50))
for word, i in word_index.items():
  if i >= VOCAB_SIZE:
      continue
  embedding_vector = glove.get(word)
  if embedding_vector is not None:
# words not found in embedding index will be all-zeros.
      embedding_matrix[i] = embedding_vector

In [23]:
embedding_layer = keras.layers.Embedding(input_dim=VOCAB_SIZE,
                            output_dim=50,
                            input_length=MAX_SEQUENCE_LENGTH,
                            embeddings_initializer=keras.initializers.Constant(embedding_matrix), #Glove
                            trainable=True)

input_ids = keras.Input(shape=(MAX_SEQUENCE_LENGTH,), dtype="int64")
embedded_sequences = embedding_layer(input_ids)
x = keras.layers.Bidirectional(keras.layers.LSTM(64, dropout=0.2, return_sequences=True))(embedded_sequences)
x = keras.layers.Bidirectional(keras.layers.LSTM(64))(x)
x = keras.layers.Dropout(0.2)(x)
outputs = keras.layers.Dense(NUM_CLASS, activation='softmax')(x)

bilstm_classifier = keras.Model(input_ids, outputs)

bilstm_classifier.summary()
bilstm_classifier.compile(
    optimizer=keras.optimizers.Adam(learning_rate=0.001),
    loss="KLD",
    metrics=["accuracy"],
)
bilstm_classifier.fit(X_train, y_train, epochs=10, validation_data=[X_test, y_test], batch_size=BATCH_SIZE, shuffle=True)

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_5 (InputLayer)        [(None, 64)]              0         
                                                                 
 embedding_4 (Embedding)     (None, 64, 50)            500000    
                                                                 
 bidirectional_8 (Bidirectio  (None, 64, 128)          58880     
 nal)                                                            
                                                                 
 bidirectional_9 (Bidirectio  (None, 128)              98816     
 nal)                                                            
                                                                 
 dropout_5 (Dropout)         (None, 128)               0         
                                                                 
 dense_4 (Dense)             (None, 2)                 258 

<keras.callbacks.History at 0x7fd8ef1a1e50>