# BERT for QQP

- Fine-tune BERT on quora question pairs

From my point of view, there are two ways to use BERT in qqp, one way is using the pretrained BERT to embed sentences, the embedded vectors are the features in the NN model, and the rest steps are easy (feed the features into neural network sturctures and classify question pairs). Actually there are some implementations on that such as `SBERT`(Sentence BERT, or `sentence-transformations`), we can also use packages like `bert as a service` to use the pretrained model easily. Another way is to do fine-tuning the BERT model on downstream tasks, which may achieve a better performance since the pretrained model will be trained on the downstream training dataset. So here we implement the second idea, we did fine-tuning for BERT on qqp problem.

The model contains two parts:
- The first part is **BERT pretrained model**(using `bert-base-uncased` model): to embed input ([id,attention_masks,token_type_id]).
- The second part is **a neural network sturcture** consists of {bilstm-pooling-dense-sigmoid} with dropout.

In the model, these things are done step by step:
  - prepare the raw data (load, preprocess...)
  - create a data generator and generate training and testing data for the model
  - set bert_model.trainable=False, train the model (train the layers in the neural network structure)
  - set bert_model.trainable=Ture, train the model again (fine-tuning BERT)
  - make predictions on the testing set
  - get submissions on qqp problem

 Although not better than our featured Siamese-LSTM model (becasue this model is not mature), there are some improvements will be done in the future :
  - add more features (feature_nlp.csv, feature_tm.csv, etc...) from `feature_engineering_train.ipynb` as a supplement
  - adjust the neural network structure (layers to perform 'feature extraction')
  - hyperparameters and training epoches (we just use 1 epoch training to get the scores due to time and resources limit)
  - ...


**NOTES:**
- This data runs on conda env: tensorflow_env(tensorflow 2.3.1 version) in my macbook16
- With reference of keras official documents and transformer official documents to implement BERT fine-tuning to the qqp problem.
  - https://huggingface.co/transformers/model_doc/bert.html
  - https://keras.io/examples/nlp/semantic_similarity_with_bert/
  - https://keras.io/examples/nlp/masked_language_modeling/
  - https://www.cnblogs.com/dogecheng/p/11617940.html
  
  - SBERT: https://www.sbert.net/docs/training/overview.html?highlight=get_word_embedding_dimension
  - BERT as a service: https://github.com/hanxiao/bert-as-service#building-a-qa-semantic-search-engine-in-3-minutes

In [15]:
# Install transformers

!pip install transformers



In [16]:
import numpy as np
import pandas as pd
import tensorflow as tf
import transformers  # tf_version:2.3.1 have swish activation function


In [3]:
# Hyperparameters

max_length = 128  # Maximum length of input sentence to the model.
batch_size = 32
epochs = 2


In [19]:
from sklearn.model_selection import train_test_split

# There are more than 550k samples in total; we will use 100k for this example.
train_df = pd.read_csv("Data/train.csv")
train_df, valid_df = train_test_split(train_df, test_size=0.1, random_state=42)
test_df = pd.read_csv('Data/test.csv')

# Shape of the data
print(f"Total train samples : {train_df.shape}")
print(f"Total validation samples: {valid_df.shape}")
print(f"Total test samples: {test_df.shape}")

Total train samples : (363861, 6)
Total validation samples: (40429, 6)
Total test samples: (2345796, 3)


In [20]:
print(type(train_df))
print(type(test_df))
print(type(valid_df))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [22]:
# Drop Nan values on dataset
print("Number of missing values")
train_df = train_df.fillna(' ')
test_df = test_df.fillna(' ')
valid_df = valid_df.fillna(' ')

print(train_df.isnull().sum())


Number of missing values
id              0
qid1            0
qid2            0
question1       0
question2       0
is_duplicate    0
dtype: int64


In [25]:
train_df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
58017,58017,101876,82088,What are some mechanical engineering projects?,What are some mechanical projects?,1
75700,75700,129504,13615,How do you know if she is interested in you?,How do I know if she is interested?,1
333380,333380,7477,169540,"What is the best reply to ""thank you"" in forma...",How do I respond when someone gives me a compl...,0
253960,253960,368596,368597,"Is the name ""Jimmy"" an alternate name for ""Joh...",I weigh 61.5kg and want to compete at 60kg(box...,0
251332,251332,177375,200251,What is the most inspirational book you have e...,Have you ever read a book that truly inspired ...,1


In [27]:
test_df.head()

Unnamed: 0,test_id,question1,question2
0,0,How does the Surface Pro himself 4 compare wit...,Why did Microsoft choose core m3 and not core ...
1,1,Should I have a hair transplant at age 24? How...,How much cost does hair transplant require?
2,2,What but is the best way to send money from Ch...,What you send money to China?
3,3,Which food not emulsifiers?,What foods fibre?
4,4,"How ""aberystwyth"" start reading?",How their can I start reading?


In [29]:
# Split data and labels

train_df["label"] = train_df["is_duplicate"]
#y_train = tf.keras.utils.to_categorical(train_df.label, num_classes=2)
y_train = train_df.label.values

valid_df["label"] = valid_df["is_duplicate"]
y_val = valid_df.label.values


In [30]:
# Data generator: change the raw train, validation and test data into 
# !! the batch_size is better dividable by the numbers of trainning data and test data

class BertSemanticDataGenerator(tf.keras.utils.Sequence):
    '''
    Generates batches of data.

    Parameters:
        sentence_pairs: Array of premise and hypothesis input sentences.
        labels: Array of labels.
        batch_size: Integer batch size.
        shuffle: boolean, whether to shuffle the data.
        include_targets: boolean, whether to incude the labels.

    Returns:
        Tuples `([input_ids, attention_mask, `token_type_ids], labels)`
        (or just `[input_ids, attention_mask, `token_type_ids]`
         if `include_targets=False`)
    '''

    def __init__(
        self,
        sentence_pairs,
        labels,
        batch_size=batch_size,
        shuffle=True,
        include_targets=True,
    ):
        self.sentence_pairs = sentence_pairs
        self.labels = labels
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.include_targets = include_targets
        # Load our BERT Tokenizer to encode the text.
        # We will use base-base-uncased pretrained model.
        self.tokenizer = transformers.BertTokenizer.from_pretrained(
            "bert-base-uncased", do_lower_case=True
        )
        self.indexes = np.arange(len(self.sentence_pairs))
        self.on_epoch_end()

    def __len__(self):
        # The number of batches per epoch.
        return len(self.sentence_pairs) // self.batch_size

    def __getitem__(self, idx):
        # Retrieves the batch of index.
        indexes = self.indexes[idx * self.batch_size : (idx + 1) * self.batch_size]
        sentence_pairs = self.sentence_pairs[indexes]

        # With BERT tokenizer's batch_encode_plus batch of both the sentences are
        # encoded together and separated by [SEP] token.
        encoded = self.tokenizer.batch_encode_plus(
            sentence_pairs.tolist(),
            add_special_tokens=True,
            max_length=max_length,
            return_attention_mask=True,
            return_token_type_ids=True,
            pad_to_max_length=True,
            return_tensors="tf",
        )

        # Convert batch of encoded features to numpy array.
        input_ids = np.array(encoded["input_ids"], dtype="int32")
        attention_masks = np.array(encoded["attention_mask"], dtype="int32")
        token_type_ids = np.array(encoded["token_type_ids"], dtype="int32")

        # Set to true if data generator is used for training/validation.
        if self.include_targets:
            labels = np.array(self.labels[indexes], dtype="int32")
            return [input_ids, attention_masks, token_type_ids], labels
        else:
            return [input_ids, attention_masks, token_type_ids]

    def on_epoch_end(self):
        # Shuffle indexes after each epoch if shuffle is set to True.
        if self.shuffle:
            np.random.RandomState(42).shuffle(self.indexes)


In [31]:
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension



Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: [32mOK[0m


In [32]:
# Create the model under a distribution strategy scope.
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    # Encoded token ids from BERT tokenizer.
    input_ids = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="input_ids"
    )
    # Attention masks indicates to the model which tokens should be attended to.
    attention_masks = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="attention_masks"
    )
    # Token type ids are binary masks identifying different sequences in the model.
    token_type_ids = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="token_type_ids"
    )
    # Loading pretrained BERT model.
    bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
    # Freeze the BERT model to reuse the pretrained features without modifying them.
    bert_model.trainable = False

    sequence_output, pooled_output = bert_model(
        input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids
    )
    # Add trainable layers on top of frozen layers to adapt the pretrained features on the new data.
    bi_lstm = tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64, return_sequences=True)
    )(sequence_output)
    # Applying hybrid pooling approach to bi_lstm sequence output.
    avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
    max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
    concat = tf.keras.layers.concatenate([avg_pool, max_pool])
    dropout = tf.keras.layers.Dropout(0.3)(concat)
    output = tf.keras.layers.Dense(1, activation="sigmoid")(dropout)
    model = tf.keras.models.Model(
        inputs=[input_ids, attention_masks, token_type_ids], outputs=output
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss="binary_crossentropy",
        metrics=["acc"],
    )


print(f"Strategy: {strategy}")
model.summary()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)


Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


Strategy: <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7fbfae1712e0>
Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 128)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model_1 (TFBertModel)   ((None, 128, 768), ( 109482240   input_ids[0][0]     

In [33]:
train_data = BertSemanticDataGenerator(
    train_df[["question1", "question2"]].values.astype("str"),
    y_train,
    batch_size=batch_size,
    shuffle=True,
)

valid_data = BertSemanticDataGenerator(
    valid_df[["question1", "question2"]].values.astype("str"),
    y_val,
    batch_size=batch_size,
    shuffle=False,
)

In [34]:
history = model.fit(
    train_data,
    validation_data=valid_data,
    epochs=epochs,
    use_multiprocessing=True,
    workers=-1,  
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch 1/2

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch 2/2


In [35]:
# change epoch to 1
epochs = 1

In [36]:
# Unfreeze the bert_model.
bert_model.trainable = True
# Recompile the model to make the change effective.
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss="binary_crossentropy",
    metrics=["accuracy"],
)
model.summary()

Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 128)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model_1 (TFBertModel)   ((None, 128, 768), ( 109482240   input_ids[0][0]                  
                                                                 attention_masks[0][0] 

In [37]:

history = model.fit(
    train_data,
    validation_data=valid_data,
    epochs=epochs,
    use_multiprocessing=True,
    workers=-1,
)





In [40]:
test_data = BertSemanticDataGenerator(
    test_df[["question1", "question2"]].values.astype("str"), 
    labels=None, 
    batch_size=36, 
    shuffle=False, 
    include_targets=False
)
# the batch_size should be able to be devided by 2345796 which is the test_data shape
pred_probs = model.predict(
    test_data, 
    batch_size=1024, 
    verbose=1, 
    workers=-1
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.




In [62]:
print('Making the submission')
test_ids = test_df.test_id.values
submission = pd.DataFrame({'test_id':test_ids, 'is_duplicate':preds_final.ravel()})
submission.to_csv('Models/bert_1.csv', index=False)

Making the submission
