In [1]:
import pandas as pd
import numpy as np
import ast


from BERTpredictor import *
from BERTtuner import *
from DataPreper import *
from QueryMatcher import *

from bayes_opt import BayesianOptimization

Using state Vilnius server backend.


### Introduction

##### Objective

The goal of this project is to be able to predict relevant questions from natural language query/text.

##### Methods

There are various aproaches for semantic language similarity task. Common apporach is using embedding vectors of words, and them comparing high dimensional similarity between embedding vectors. Popular methods such as Word2Vec and Glove does a great job in representing words as vectors, however they are limited by the fact that words get a constant vector, irrespective of a context that words apear in. More modern approaches, specifically transformer-like methods such as Bert, GPT-2 or XLnet allows for flexible vectors depending on the context it apears on. Such models often come semi-ready for a lot of tasks, being pretrained on large corpuses. Nonetheless they are flexible to be fine-tuned for the task in hand by retraining/fine-tuning parameters using domain specific corpus. Approach taken in this project is to use BERT (base case uncased) fine tuned on domain corpus, which are conversational logs of clients seeking medical advice. 

Base-case uncased BERT model is trained on large wikipedia corpus and contains 110M parameters. There are other potentially promising pre-trained transformer models such as DeepPavlov/bert-base-cased-conversational, which are trained on conversational natural language texts, however testing those is beyond the scope of this project. Also there are significant number of ways text data could be prepared/engineered and vector similarity metrics. Some of them were explored here, however choices are abundant and user is free to expand, experiment further. It is important to note that embedding calculation uses sum from last 4 (out of 12) layers as suggested in research giving most accurate representation.

![](1_fWh1m6FyC6bAs3Qfh9iVmg.png)

##### Results

BERT model had been fine-tuned using combined NSP/MLP methodology (https://huggingface.co/bert-base-uncased), on a tiny dummy corpus. Results and usefulness of the model seem ambigous, however framework, if used with more extensive dataset should provide more promsing results.  

### BERT fine tunning (section 2)


This is the main section for BERT model fine-tunning. Few different data augmentation strategies experimented with basic/default BERT hyperparemeters to establish benchmark model. 

##### Initial data prep

One of the necessary steps in data preparation for tunning is bert is to create subsequent/random sentence pairs (NSP head) in the from (sentence, subsequent sentence). Balanced data had been created with one random pair for every original pair. There are ways to improve by creating random pairs chosing sentences from different logs, however in the absense of training data random pairs had been created from the same log.

##### No data augmentation

In order to do intial tests no data augmentation was implemented. Default parameters was used. Negative combined (NLP + MLP) log likelihood loss was calculated on training data for each epoch which droped to 4.85 after 5 epochs.

##### Synonym insertion

In order to create first augmented dataset a synonym insertation was used based on NLTK (wordnet) package. (https://www.holisticseo.digital/python-seo/nltk/wordnet)  Stopwords had been removed, before changing 0.3 words with synonyms no less than 1 word and no more than 10. One augmented sentence created for every original sentence. Training loss dropped to 4.7 after 5 epochs with original parameters.

##### Back translation

Second augmented dataset was created by translating sentences to foreign language (german was used in this study), and then back. Module used - https://pypi.org/project/translators/. One augmented sentence created for every original sentence. Training loss dropped to 4.85 after 5 epochs with original parameters.

##### Back translation + synonym insertation

Two augmentation techniques combined to create an extended dataset (3x the original). Training loss dropped to 4.7 after 5 epochs with original parameters.

### BERT optimization via Bayes opt (section 3)

In order to find optimal hyperparameters Bayesian Optimization was chosen due to very expensive evaluation. This is the method to squeeze the last juice out of ML model, however should come after extensive experimentation with feature engineering and data gathering. Bayesian optimization based on - https://github.com/fmfn/BayesianOptimization. 

##### Prepare loss function

In order to evaluate model a common function from information retrieval theory had been chosen which is **top k precision**. Meaning the proportion of relevant documents (questions) in top k retrieved/predicted documents (questions). 

##### Prepare objective function

Objective function to be maximized (returning **top k precision**) had been created which takes optimizeable parameters **epochs, batch size, learning rate**. It is possible to expand objective function by adding customizeable data preparation techniques, however for the purpose of this project objective function maintained smaller, and best feature engineering type was chosen in previous section based on results on training data.

##### Optimize

Bayesian optmization engine initiaed with 3 random initializations of parameters (within provided range) and 15 expectation maximization iterations. **Best parameters : batch_size=14 (set 16 for final model), epochs=9, learning_rate=0.0063**.

### Train and save best model (section 4)

Use parameters inferred from previous section to train the best model. Best model along with tokenizer saved to local folder, together with question embeddings for quick access for a command line app. 

### Conclusions and discussion

There were few challenges faced. First of all, different aggregations of BERT hidden states had been experimented with involving different number of layers and different functions, however similar tendency observed of overly high and non-differentiating similarity scores between queries and questions, decreasing with increasing query length. (as more sentences withing query observed) Result could be severe undertraining with words triggering similar neurons. Also, there could be some unknown bug in embedding aggregation which need to be investigated. One more reason could be wrong method chosen, it is likely that framing problem to have a classification head (multilabelled data for training aka tag problem) would improve embedding accuracy. It is likely that for the method to be used for production with low latency, bi-directional transformer should be trained from scratch.

### 1. Settings

In [2]:
# Settings

CORPUS_PATH=pathlib.Path().absolute().joinpath('corpus') # Path to text corpus (conversational logs in .txt files)
CORPUS_AUG_PATH=pathlib.Path().absolute().joinpath('corpus_aug') # Augmented text corpus - synthetically edited corpus
CORPUS_LABELLED=pathlib.Path().absolute().joinpath('corpus_labelled') # Labelled corpus

QNA_PATH=pathlib.Path().absolute().joinpath('questions') # Path to questions

MODEL_PATH=pathlib.Path().absolute().joinpath('models') # Path to trained models

### 2. BERT fine tunning

##### 2.1 Initial data prep

In [3]:
# Initialize data_prepper class

data_prepper=DataPrepper()

In [4]:
# Corpus text to lists (also split into sentences)

raw_text_list=data_prepper.txt_to_lists(CORPUS_PATH,to_sentence=True)
raw_text_list

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gedas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[['Hello, my name is Alice.',
  'I’m calling from Chicago and want to ask some questions.',
  'I’m pregnant for 6 months now, but I’m not telling anyone about this.',
  'I have periodic headaches.',
  'When I work, I feel like my ability to concentrate is being hindered by them.',
  'It’s already hard to work from 9 to 5 every day, God, and now this.',
  'My mom told me about this wonder drug called paracetamol.',
  'She assured me that it would help me a lot.',
  'I’m not sure if that is okay.',
  'It’s not like I’m a specialist in this field or anything so I decided to call here to be sure just in case.',
  'Can I use this medicine safely and will it help me?']]

In [5]:
# List of text sentences to list of subsequent sentence pairs

raw_sent_pairs=data_prepper.text_list_to_sent_pairs(raw_text_list)
raw_sent_pairs

[[('Hello, my name is Alice.',
   'I’m calling from Chicago and want to ask some questions.'),
  ('I’m calling from Chicago and want to ask some questions.',
   'I’m pregnant for 6 months now, but I’m not telling anyone about this.'),
  ('I’m pregnant for 6 months now, but I’m not telling anyone about this.',
   'I have periodic headaches.'),
  ('I have periodic headaches.',
   'When I work, I feel like my ability to concentrate is being hindered by them.'),
  ('When I work, I feel like my ability to concentrate is being hindered by them.',
   'It’s already hard to work from 9 to 5 every day, God, and now this.'),
  ('It’s already hard to work from 9 to 5 every day, God, and now this.',
   'My mom told me about this wonder drug called paracetamol.'),
  ('My mom told me about this wonder drug called paracetamol.',
   'She assured me that it would help me a lot.'),
  ('She assured me that it would help me a lot.',
   'I’m not sure if that is okay.'),
  ('I’m not sure if that is okay.',
 

In [6]:
# Create balanced data of random sentence pairs (uset all texts all sentences and one random instance pair per sentence)

raw_sent_notpairs=data_prepper.sent_pairs_to_random_pairs(raw_sent_pairs,text_resample_size=1.0,sent_resample_size=1.0,n_resamples=1)
raw_sent_notpairs

[[('She assured me that it would help me a lot.',
   'My mom told me about this wonder drug called paracetamol.'),
  ('Hello, my name is Alice.', 'I’m not sure if that is okay.'),
  ('It’s not like I’m a specialist in this field or anything so I decided to call here to be sure just in case.',
   'My mom told me about this wonder drug called paracetamol.'),
  ('I have periodic headaches.',
   'Can I use this medicine safely and will it help me?'),
  ('I’m calling from Chicago and want to ask some questions.',
   'I’m not sure if that is okay.'),
  ('It’s already hard to work from 9 to 5 every day, God, and now this.',
   'Can I use this medicine safely and will it help me?'),
  ('I’m pregnant for 6 months now, but I’m not telling anyone about this.',
   'Hello, my name is Alice.'),
  ('When I work, I feel like my ability to concentrate is being hindered by them.',
   'Can I use this medicine safely and will it help me?'),
  ('I’m not sure if that is okay.',
   'Can I use this medicine s

In [7]:
# Parse question files

questions_list=data_prepper.txt_to_lists(QNA_PATH)
questions_list

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gedas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[['Can paracetamol be used during pregnancy?',
  'What paracetamol is used for?',
  'What are paracetamol interactions with other medicine?',
  'Why is it raining today?',
  'When the war in Ukraine will end?',
  'What is the oldest town in London?']]

##### 2.2 No data augmentation

In [8]:
# Initialize BERT tuner class

BERT_tuner=BERTtuner()
BERT_tuner.load_from_web(model_name='bert-base-uncased')

  obj = cast(Storage, torch._UntypedStorage(nbytes))
Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


In [9]:
# Creat dataset for BERT with no data augmentation (add pairs and not pairs)

bert_inputs_noaug=BERT_tuner.prepare_data_for_BERT_train(raw_sent_pairs,raw_sent_notpairs)
bert_inputs_noaug

{'input_ids': tensor([[ 101, 7592, 1010,  ...,    0,    0,    0],
        [ 101, 1045, 1521,  ...,    0,    0,    0],
        [ 101, 1045,  103,  ...,    0,    0,    0],
        ...,
        [ 101, 2043,  103,  ...,    0,    0,    0],
        [ 101,  103, 1521,  ...,    0,    0,    0],
        [ 101, 2026, 3566,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'next_sentence_label': tensor([[0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [1],
        [1],
      

In [10]:
# Run bechmark BERT

m1,loss1=BERT_tuner.train_BERT(bert_inputs_noaug,epochs=5,batch_size=16,learning_rate=1e-4)

  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:48<00:00, 24.45s/it, loss=14]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:55<00:00, 27.57s/it, loss=10.1]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:55<00:00, 27.82s/it, loss=7.07]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:56<00:00, 28.44s/it, loss=6.06]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:55<00:00, 27.53s/it, loss=4.85]


In [11]:
# Prepare question embeddings using newly trained model (query question)

query_matcher=QueryMatcher(m1,BERT_tuner.model_tokenizer,False,True) # Init query matcher class
#query_matcher.parse_questions_files(save_index=True) # Save question index (first run only)
#query_matcher.parse_queries_files(save_index=True) # Save queries index (first run only)
question_embeddings=query_matcher.calc_question_embeddings(questions_list,save=False) # Calcuate embeddings
question_embeddings

100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.60it/s]


{'input_ids': tensor([[  101,  2064, 11498,  ...,     0,     0,     0],
        [  101,  2054,  2024,  ...,     0,     0,     0],
        [  101,  2054,  2003,  ...,     0,     0,     0],
        [  101,  2054, 11498,  ...,     0,     0,     0],
        [  101,  2043,  1996,  ...,     0,     0,     0],
        [  101,  2339,  2003,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'embeddings': tensor([[ 0.0815,  0.1221,  0.0295,  ..., -0.1476, -0.0154, -0.0385],
        [ 0.0842,  0.1325,  0.0305,  ..., -0.1609, -0.0156, -0.0422],
        [ 0.0618,  0.1013,  0.0190

In [12]:
# Run query matcher

res_1=query_matcher.match_queries(raw_text_list,prep_type='concat')
res_1

{'log_1.txt': {'0_sentences': [('questions_1.txt_2', tensor(0.9992)),
   ('questions_1.txt_5', tensor(0.9991)),
   ('questions_1.txt_4', tensor(0.9990)),
   ('questions_1.txt_3', tensor(0.9987)),
   ('questions_1.txt_1', tensor(0.9982)),
   ('questions_1.txt_0', tensor(0.9978))],
  '1_sentences': [('questions_1.txt_4', tensor(0.9982)),
   ('questions_1.txt_2', tensor(0.9981)),
   ('questions_1.txt_3', tensor(0.9979)),
   ('questions_1.txt_5', tensor(0.9978)),
   ('questions_1.txt_1', tensor(0.9977)),
   ('questions_1.txt_0', tensor(0.9973))],
  '2_sentences': [('questions_1.txt_4', tensor(0.9938)),
   ('questions_1.txt_2', tensor(0.9934)),
   ('questions_1.txt_5', tensor(0.9932)),
   ('questions_1.txt_3', tensor(0.9931)),
   ('questions_1.txt_1', tensor(0.9929)),
   ('questions_1.txt_0', tensor(0.9923))],
  '3_sentences': [('questions_1.txt_4', tensor(0.9923)),
   ('questions_1.txt_2', tensor(0.9918)),
   ('questions_1.txt_5', tensor(0.9916)),
   ('questions_1.txt_3', tensor(0.9915)),


##### 2.3 Synonym insertation (augmentation technique 1)

In [10]:
# Synonym insertation (pairs) based on wordnet (2 new sentences for each one)

raw_sent_pairs_synaug=data_prepper.aug_syn_swap(text_list=raw_sent_pairs,aug_p=0.3,aug_min=1, aug_max=10,n_new_sent=2,path=CORPUS_AUG_PATH,save=True)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gedas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
# Synonym insertation (not pairs) based on wordnet (2 new sentences for each one)

raw_sent_notpairs_synaug=data_prepper.aug_syn_swap(raw_sent_notpairs,aug_p=0.3,aug_min=1, aug_max=10,n_new_sent=2,path=CORPUS_AUG_PATH,save=True)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\gedas\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [28]:
# Creat dataset for BERT with no syn insertaion data augmentation

BERT_tuner=BERTtuner()
BERT_tuner.load_from_web(model_name='bert-base-uncased')

bert_inputs_synaug=BERT_tuner.prepare_data_for_BERT_train(raw_sent_pairs,raw_sent_notpairs)
bert_inputs_synaug

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


{'input_ids': tensor([[ 101, 7592,  103,  ...,    0,    0,    0],
        [ 101, 1045, 1521,  ...,    0,    0,    0],
        [ 101, 1045, 1521,  ...,    0,    0,    0],
        ...,
        [ 101, 2043, 1045,  ...,    0,    0,    0],
        [ 101, 1045, 1521,  ...,    0,    0,    0],
        [ 101, 2016,  103,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'next_sentence_label': tensor([[0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [1],
        [1],
      

In [11]:
# Run BERT with augmented data via synonym insertation

m2,loss2=BERT_tuner.train_BERT(bert_inputs_synaug,epochs=5,batch_size=16,learning_rate=1e-4)

  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:52<00:00, 26.43s/it, loss=14.6]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:54<00:00, 27.42s/it, loss=9.17]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:52<00:00, 26.13s/it, loss=7.53]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:56<00:00, 28.48s/it, loss=5.93]
Epoch 4: 100%|█████████████████████████████████████████████████████████████████| 2/2 [00:51<00:00, 25.56s/it, loss=4.7]


In [13]:
# Prepare question embeddings using newly trained model (query question)

query_matcher=QueryMatcher(m2,BERT_tuner.model_tokenizer,False,True) # Init query matcher class
question_embeddings=query_matcher.calc_question_embeddings(questions_list,save=False) # Calcuate embeddings
question_embeddings

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gedas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.97it/s]


{'input_ids': tensor([[  101,  2064, 11498,  ...,     0,     0,     0],
        [  101,  2054,  2024,  ...,     0,     0,     0],
        [  101,  2054,  2003,  ...,     0,     0,     0],
        [  101,  2054, 11498,  ...,     0,     0,     0],
        [  101,  2043,  1996,  ...,     0,     0,     0],
        [  101,  2339,  2003,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'embeddings': tensor([[ 0.1025,  0.1519,  0.0126,  ..., -0.1515,  0.0138, -0.0596],
        [ 0.1049,  0.1653,  0.0110,  ..., -0.1642,  0.0158, -0.0644],
        [ 0.0765,  0.1268,  0.0038

In [14]:
# Run query matcher

res_1=query_matcher.match_queries(raw_text_list,prep_type='concat')
res_1

{'query_0': {'0_sentences': [('q_2', tensor(0.9988)),
   ('q_5', tensor(0.9987)),
   ('q_4', tensor(0.9987)),
   ('q_3', tensor(0.9983)),
   ('q_1', tensor(0.9977)),
   ('q_0', tensor(0.9973))],
  '1_sentences': [('q_4', tensor(0.9979)),
   ('q_3', tensor(0.9976)),
   ('q_2', tensor(0.9976)),
   ('q_5', tensor(0.9973)),
   ('q_1', tensor(0.9973)),
   ('q_0', tensor(0.9969))],
  '2_sentences': [('q_4', tensor(0.9933)),
   ('q_3', tensor(0.9926)),
   ('q_2', tensor(0.9925)),
   ('q_5', tensor(0.9923)),
   ('q_1', tensor(0.9921)),
   ('q_0', tensor(0.9918))],
  '3_sentences': [('q_4', tensor(0.9919)),
   ('q_3', tensor(0.9912)),
   ('q_2', tensor(0.9910)),
   ('q_5', tensor(0.9908)),
   ('q_1', tensor(0.9906)),
   ('q_0', tensor(0.9903))],
  '4_sentences': [('q_4', tensor(0.9866)),
   ('q_2', tensor(0.9857)),
   ('q_3', tensor(0.9856)),
   ('q_5', tensor(0.9854)),
   ('q_1', tensor(0.9848)),
   ('q_0', tensor(0.9844))],
  '5_sentences': [('q_4', tensor(0.9762)),
   ('q_2', tensor(0.9753))

##### 2.4 Back translation

In [12]:
# Synonym insertation (pairs) based on wordnet (2 new sentences for each one)

raw_sent_pairs_transaug=data_prepper.aug_trans_swap(raw_sent_pairs,from_lang='en',to_lang='de')

In [13]:
# Synonym insertation (not pairs) based on wordnet (2 new sentences for each one)

raw_sent_notpairs_transaug=data_prepper.aug_trans_swap(raw_sent_notpairs,from_lang='en',to_lang='de')

In [31]:
# Creat dataset for BERT with language back translation data augmentation

BERT_tuner=BERTtuner()
BERT_tuner.load_from_web(model_name='bert-base-uncased')

bert_inputs_transaug=BERT_tuner.prepare_data_for_BERT_train(raw_sent_pairs_transaug,raw_sent_notpairs_transaug)
bert_inputs_transaug

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


{'input_ids': tensor([[ 101, 7592, 1010,  ...,    0,    0,    0],
        [ 101,  103, 2655,  ...,    0,    0,    0],
        [ 101,  103, 1005,  ...,    0,    0,    0],
        ...,
        [ 101, 2043, 1045,  ...,    0,    0,    0],
        [ 101, 1045, 1005,  ...,    0,    0,    0],
        [ 101, 2016, 8916,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'next_sentence_label': tensor([[0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [1],
        [1],
      

In [19]:
# Run bechmark BERT

m3,loss3=BERT_tuner.train_BERT(bert_inputs_transaug,epochs=5,batch_size=16,learning_rate=1e-4)

Epoch 0: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:47<00:00, 23.64s/it, loss=14.9]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:53<00:00, 26.67s/it, loss=9.25]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:55<00:00, 27.93s/it, loss=6.79]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 2/2 [01:01<00:00, 30.72s/it, loss=5.81]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:53<00:00, 26.66s/it, loss=4.82]


In [20]:
# Prepare question embeddings using newly trained model (query question)

query_matcher=QueryMatcher(m3,BERT_tuner.model_tokenizer,False,True) # Init query matcher class
question_embeddings=query_matcher.calc_question_embeddings(questions_list,save=False) # Calcuate embeddings
question_embeddings

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gedas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:04<00:00,  1.45it/s]


{'input_ids': tensor([[  101,  2064, 11498,  ...,     0,     0,     0],
        [  101,  2054,  2024,  ...,     0,     0,     0],
        [  101,  2054,  2003,  ...,     0,     0,     0],
        [  101,  2054, 11498,  ...,     0,     0,     0],
        [  101,  2043,  1996,  ...,     0,     0,     0],
        [  101,  2339,  2003,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'embeddings': tensor([[ 0.0751,  0.0975,  0.0400,  ..., -0.1474,  0.0068, -0.0318],
        [ 0.0773,  0.1062,  0.0407,  ..., -0.1609,  0.0088, -0.0353],
        [ 0.0545,  0.0808,  0.0276

In [21]:
# Run query matcher

res_1=query_matcher.match_queries(raw_text_list,prep_type='concat')
res_1

{'query_0': {'0_sentences': [('q_2', tensor(0.9991)),
   ('q_5', tensor(0.9989)),
   ('q_4', tensor(0.9988)),
   ('q_3', tensor(0.9986)),
   ('q_1', tensor(0.9981)),
   ('q_0', tensor(0.9978))],
  '1_sentences': [('q_4', tensor(0.9983)),
   ('q_2', tensor(0.9982)),
   ('q_3', tensor(0.9981)),
   ('q_1', tensor(0.9980)),
   ('q_5', tensor(0.9979)),
   ('q_0', tensor(0.9977))],
  '2_sentences': [('q_4', tensor(0.9953)),
   ('q_2', tensor(0.9950)),
   ('q_3', tensor(0.9948)),
   ('q_1', tensor(0.9947)),
   ('q_5', tensor(0.9947)),
   ('q_0', tensor(0.9944))],
  '3_sentences': [('q_4', tensor(0.9942)),
   ('q_2', tensor(0.9940)),
   ('q_3', tensor(0.9938)),
   ('q_1', tensor(0.9938)),
   ('q_5', tensor(0.9936)),
   ('q_0', tensor(0.9934))],
  '4_sentences': [('q_4', tensor(0.9904)),
   ('q_2', tensor(0.9902)),
   ('q_3', tensor(0.9898)),
   ('q_5', tensor(0.9898)),
   ('q_1', tensor(0.9897)),
   ('q_0', tensor(0.9893))],
  '5_sentences': [('q_4', tensor(0.9831)),
   ('q_2', tensor(0.9829))

##### 2.5 Back translation + synonym insertation

In [14]:
# Concatenate syn aug and trans aug

ran_sent_pairs_combaug=raw_sent_pairs_transaug+raw_sent_pairs_synaug # Combine pair data for synonym augmentation and translation augmentaion
raw_sent_notpairs_combaug=raw_sent_notpairs_transaug+raw_sent_notpairs_synaug # Combine not pair data for synonym augmentation and translation augmentaion

In [15]:
# Creat dataset for BERT with language back translation data augmentation combined with syn insertation augmentaiton

BERT_tuner=BERTtuner()
BERT_tuner.load_from_web(model_name='bert-base-uncased')

bert_inputs_combaug=BERT_tuner.prepare_data_for_BERT_train(ran_sent_pairs_combaug,raw_sent_notpairs_combaug)
bert_inputs_combaug

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


{'input_ids': tensor([[  101,  7592,  1010,  ...,     0,     0,     0],
        [  101,  1045,  2655,  ...,     0,     0,     0],
        [  101,  1045,  1005,  ...,     0,     0,     0],
        ...,
        [  101,  2019,  2063,  ...,     0,     0,     0],
        [  101,   103, 20565,  ...,     0,     0,     0],
        [  101,  2026,   103,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'next_sentence_label': tensor([[0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [0],
        [

In [28]:
# Run bechmark BERT

m4,loss4=BERT_tuner.train_BERT(bert_inputs_combaug,epochs=5,batch_size=16,learning_rate=1e-4)

Epoch 0: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:47<00:00, 41.81s/it, loss=9.52]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:52<00:00, 43.11s/it, loss=5.98]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:48<00:00, 42.19s/it, loss=3.12]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:45<00:00, 41.45s/it, loss=1.09]
Epoch 4: 100%|███████████████████████████████████████████████████████████████| 4/4 [02:57<00:00, 44.50s/it, loss=0.501]


In [29]:
# Prepare question embeddings using newly trained model (query question)

query_matcher=QueryMatcher(m4,BERT_tuner.model_tokenizer,False,True) # Init query matcher class
question_embeddings=query_matcher.calc_question_embeddings(questions_list,save=False) # Calcuate embeddings
question_embeddings

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gedas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.59it/s]


{'input_ids': tensor([[  101,  2064, 11498,  ...,     0,     0,     0],
        [  101,  2054,  2024,  ...,     0,     0,     0],
        [  101,  2054,  2003,  ...,     0,     0,     0],
        [  101,  2054, 11498,  ...,     0,     0,     0],
        [  101,  2043,  1996,  ...,     0,     0,     0],
        [  101,  2339,  2003,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]), 'embeddings': tensor([[ 0.0539,  0.0405,  0.0050,  ..., -0.1086,  0.0415, -0.0420],
        [ 0.0554,  0.0442,  0.0048,  ..., -0.1174,  0.0462, -0.0456],
        [ 0.0397,  0.0331,  0.0010

In [31]:
# Run query matcher

res_1=query_matcher.match_queries(raw_text_list,prep_type='concat')
res_1

{'query_0': {'0_sentences': [('q_2', tensor(0.9995)),
   ('q_5', tensor(0.9994)),
   ('q_4', tensor(0.9993)),
   ('q_3', tensor(0.9992)),
   ('q_1', tensor(0.9990)),
   ('q_0', tensor(0.9988))],
  '1_sentences': [('q_3', tensor(0.9991)),
   ('q_1', tensor(0.9991)),
   ('q_2', tensor(0.9991)),
   ('q_4', tensor(0.9990)),
   ('q_0', tensor(0.9989)),
   ('q_5', tensor(0.9989))],
  '2_sentences': [('q_1', tensor(0.9976)),
   ('q_3', tensor(0.9976)),
   ('q_0', tensor(0.9975)),
   ('q_4', tensor(0.9975)),
   ('q_2', tensor(0.9975)),
   ('q_5', tensor(0.9972))],
  '3_sentences': [('q_1', tensor(0.9971)),
   ('q_3', tensor(0.9970)),
   ('q_0', tensor(0.9969)),
   ('q_4', tensor(0.9969)),
   ('q_2', tensor(0.9969)),
   ('q_5', tensor(0.9966))],
  '4_sentences': [('q_1', tensor(0.9946)),
   ('q_3', tensor(0.9945)),
   ('q_0', tensor(0.9944)),
   ('q_2', tensor(0.9943)),
   ('q_4', tensor(0.9942)),
   ('q_5', tensor(0.9941))],
  '5_sentences': [('q_1', tensor(0.9907)),
   ('q_3', tensor(0.9907))

### 3. BERT optimization via Bayes opt

##### 0. Read labelled data

In [16]:
# Read labelled questions

data_prepper=DataPrepper()
target_data_df=data_prepper.prepare_corpus_labelled(CORPUS_LABELLED)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\gedas\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


##### 3.1 Prepare loss function

In [17]:
# Calculate top n precision (common metric in document retrieval systems)

def calculate_topn_precision(target_data:pd.DataFrame,predicted_data:pd.DataFrame,top_n:int=3):
    
    #1 Reset index
    predicted_data.reset_index(inplace=True)
    
    #2. Type conversion for merging
    predicted_data['question_id']=predicted_data['question_id'].astype(int)
    target_data['relevant_q_id']=target_data['relevant_q_id'].astype(int)

    #3. Merging
    t_p=predicted_data.merge(target_data,left_on=['question_file','question_id'],right_on=['question_file','relevant_q_id'],how='outer')
    
    #4. FFill log file id
    t_p.loc[:,'log_file'].ffill(inplace=True)
    
    #4. Labelling
    t_p['question_true_label']=np.where(t_p['question_id']==t_p['relevant_q_id'],1,0)
    
    #5. Metric function
    top_n=t_p.sort_values(['log_file','question_similarity_score'],ascending=False).groupby('log_file').head(top_n)
    precision=top_n[top_n['question_true_label']==1].shape[0]/top_n.shape[0]
    
    return precision

##### 3.2 Prepare objective function

In [35]:
# Bayes optimization function

def bayesopt_obj_function(epochs=5,batch_size=16,learning_rate=1e-4,model_type='bert-base-uncased',aggregation_type='average'):
    
    #1. Initialize variables
    epochs=int(round(epochs))
    batch_size=int(round(batch_size))
    
    #2. Train and predict
    #2.1 Prepare data (use combined augmentation as gave smallest training loss)
    BERT_tuner=BERTtuner()
    BERT_tuner.load_from_web(model_name=model_type)
    bert_inputs_combaug=BERT_tuner.prepare_data_for_BERT_train(ran_sent_pairs_combaug,raw_sent_notpairs_combaug)
    
    #2.2 Train bert
    m,loss=BERT_tuner.train_BERT(bert_inputs_combaug,epochs=epochs,batch_size=batch_size,learning_rate=learning_rate)
    
    #2.3 Get new question embeddings
    query_matcher=QueryMatcher(m,BERT_tuner.model_tokenizer,False,True) # Init query matcher class
    question_embeddings=query_matcher.calc_question_embeddings(questions_list,save=False) # Calcuate embeddings
    
    #2.4 Match queries to questions (calculate queries embeddings)
    queries_matched_dict=query_matcher.match_queries(raw_text_list,prep_type='concat')
    
    #2.5 Convert result dict to df and aggregate results
    queries_matched_dict_df=query_matcher.match_queries_to_df(queries_matched_dict,aggregation_type=aggregation_type)
    
    #2.6 Calculate top k precision
    top_n_prec=calculate_topn_precision(target_data_df,queries_matched_dict_df,top_n=3)
    
    return top_n_prec

##### 3.3 Optimize

In [36]:
# Optimization parameter dict

param_dict={'epochs':(2,10),'batch_size':(8,32),'learning_rate':(1e-6,1e-1)}

In [37]:
# Optimize

opt=BayesianOptimization(bayesopt_obj_function,param_dict,verbose=2) # Creates bayesian opt function
opt.maximize(init_points=3,n_iter=15,acq='ei') # Maximize utility function

print('-' * 100)
print('Final Results')
opt.res

|   iter    |  target   | batch_... |  epochs   | learni... |
-------------------------------------------------------------


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


Epoch 0: 100%|████████████████████████████████████████████████████████████████| 2/2 [03:11<00:00, 95.77s/it, loss=52.9]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 2/2 [05:50<00:00, 175.47s/it, loss=30.9]
Epoch 2: 100%|███████████████████████████████████████████████████████████████| 2/2 [14:00<00:00, 420.03s/it, loss=40.1]
Epoch 3: 100%|███████████████████████████████████████████████████████████████| 2/2 [03:50<00:00, 115.47s/it, loss=39.7]
Epoch 4: 100%|███████████████████████████████████████████████████████████████| 2/2 [09:03<00:00, 271.67s/it, loss=25.3]
Epoch 5: 100%|███████████████████████████████████████████████████████████████| 2/2 [05:59<00:00, 179.69s/it, loss=43.5]
Epoch 6: 100%|███████████████████████████████████████████████████████████████| 2/2 [08:19<00:00, 249.52s/it, loss=36.1]
Epoch 7: 100%|███████████████████████████████████████████████████████████████| 2/2 [07:33<00:00, 226.82s/it, loss=42.3]
Epoch 8: 100%|██████████████████████████

| [0m 1       [0m | [0m 0.3333  [0m | [0m 29.55   [0m | [0m 9.388   [0m | [0m 0.09017 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


Epoch 0: 100%|████████████████████████████████████████████████████████████████| 2/2 [03:08<00:00, 94.20s/it, loss=49.9]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 2/2 [03:18<00:00, 99.46s/it, loss=45.8]
Epoch 2: 100%|███████████████████████████████████████████████████████████████| 2/2 [12:51<00:00, 386.00s/it, loss=30.6]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.54it/s]


| [95m 2       [0m | [95m 0.5     [0m | [95m 31.21   [0m | [95m 3.436   [0m | [95m 0.08291 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


Epoch 0: 100%|████████████████████████████████████████████████████████████████| 3/3 [03:03<00:00, 61.10s/it, loss=23.3]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 3/3 [06:59<00:00, 139.92s/it, loss=11.9]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 3/3 [03:29<00:00, 69.95s/it, loss=15.5]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.60it/s]


| [95m 3       [0m | [95m 0.6667  [0m | [95m 25.83   [0m | [95m 2.569   [0m | [95m 0.03509 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 5/5 [03:01<00:00, 36.34s/it, loss=8.94]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 5/5 [11:49<00:00, 141.93s/it, loss=7.03]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 5/5 [03:28<00:00, 41.71s/it, loss=13.9]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 5/5 [03:29<00:00, 41.83s/it, loss=8.21]
Epoch 4: 100%|██████████████████████████████████████████████████████████████████| 5/5 [03:32<00:00, 42.42s/it, loss=12]
Epoch 5: 100%|████████████████████████████████████████████████████████████████| 5/5 [03:33<00:00, 42.68s/it, loss=11.7]
Epoch 6: 100%|████████████████████████████████████████████████████████████████| 5/5 [03:34<00:00, 42.98s/it, loss=9.94]
Epoch 7: 100%|████████████████████████████████████████████████████████████████| 5/

| [0m 4       [0m | [0m 0.5     [0m | [0m 14.08   [0m | [0m 7.877   [0m | [0m 0.02062 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:51<00:00, 42.94s/it, loss=34.2]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 4/4 [08:11<00:00, 122.89s/it, loss=33.1]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:21<00:00, 50.28s/it, loss=23.5]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:24<00:00, 51.23s/it, loss=31.5]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:16<00:00, 49.15s/it, loss=20.5]
Epoch 5: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:20<00:00, 50.15s/it, loss=20.1]
Epoch 6: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:31<00:00, 52.87s/it, loss=26.8]
Epoch 7: 100%|████████████████████████████████████████████████████████████████| 4/

| [95m 5       [0m | [95m 0.75    [0m | [95m 16.58   [0m | [95m 9.223   [0m | [95m 0.05042 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:41<00:00, 40.34s/it, loss=9.77]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:48<00:00, 42.20s/it, loss=4.54]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:46<00:00, 41.66s/it, loss=2.05]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:56<00:00, 44.11s/it, loss=1.79]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:09<00:00, 47.39s/it, loss=1.61]
Epoch 5: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:42<00:00, 55.63s/it, loss=1.64]
Epoch 6: 100%|████████████████████████████████████████████████████████████████| 4/4 [04:05<00:00, 61.37s/it, loss=1.44]
Epoch 7: 100%|█████████████████████████████████████████████████████████████████| 4

| [95m 6       [0m | [95m 1.0     [0m | [95m 14.55   [0m | [95m 9.021   [0m | [95m 0.006341[0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|█████████████████████████████████████████████████████████████████| 3/3 [03:04<00:00, 61.36s/it, loss=145]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 3/3 [10:49<00:00, 216.39s/it, loss=88.8]
Epoch 2: 100%|███████████████████████████████████████████████████████████████| 3/3 [06:33<00:00, 131.12s/it, loss=41.6]
Epoch 3: 100%|███████████████████████████████████████████████████████████████| 3/3 [18:15<00:00, 365.29s/it, loss=51.7]
Epoch 4: 100%|███████████████████████████████████████████████████████████████| 3/3 [07:38<00:00, 152.94s/it, loss=53.2]
Epoch 5: 100%|███████████████████████████████████████████████████████████████| 3/3 [15:10<00:00, 303.62s/it, loss=34.2]
Epoch 6: 100%|███████████████████████████████████████████████████████████████| 3/3 [16:20<00:00, 326.86s/it, loss=42.1]
Epoch 7: 100%|███████████████████████████████████████████████████████████████| 3/3

| [0m 7       [0m | [0m 0.5     [0m | [0m 26.46   [0m | [0m 9.367   [0m | [0m 0.09626 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|███████████████████████████████████████████████████████████████| 5/5 [08:51<00:00, 106.34s/it, loss=17.2]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 5/5 [09:27<00:00, 113.41s/it, loss=15.2]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 5/5 [05:10<00:00, 62.17s/it, loss=13.1]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 5/5 [04:47<00:00, 57.50s/it, loss=18.1]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 5/5 [05:02<00:00, 60.53s/it, loss=17.3]
Epoch 5: 100%|████████████████████████████████████████████████████████████████| 5/5 [04:30<00:00, 54.17s/it, loss=8.96]
Epoch 6: 100%|████████████████████████████████████████████████████████████████| 5/5 [04:07<00:00, 49.42s/it, loss=12.4]
Epoch 7: 100%|████████████████████████████████████████████████████████████████| 5/

| [0m 8       [0m | [0m 0.6667  [0m | [0m 13.59   [0m | [0m 8.338   [0m | [0m 0.02299 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:01<00:00, 45.41s/it, loss=75.9]
Epoch 1: 100%|█████████████████████████████████████████████████████████████████| 4/4 [22:58<00:00, 344.56s/it, loss=41]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:45<00:00, 56.27s/it, loss=46.4]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 4/4 [05:08<00:00, 77.22s/it, loss=22.1]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:53<00:00, 58.28s/it, loss=22.5]
Epoch 5: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:42<00:00, 55.73s/it, loss=17.9]
Epoch 6: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:40<00:00, 55.03s/it, loss=19.9]
100%|█████████████████████████████████████████████████████████████████████████████

| [0m 9       [0m | [0m 1.0     [0m | [0m 16.82   [0m | [0m 6.893   [0m | [0m 0.04229 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 6/6 [06:42<00:00, 67.02s/it, loss=24.2]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 6/6 [09:17<00:00, 92.89s/it, loss=37.3]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 6/6 [02:45<00:00, 27.63s/it, loss=26.9]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.60it/s]


| [0m 10      [0m | [0m 0.6667  [0m | [0m 11.38   [0m | [0m 3.37    [0m | [0m 0.03928 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|███████████████████████████████████████████████████████████████| 4/4 [08:00<00:00, 120.14s/it, loss=34.7]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:26<00:00, 51.68s/it, loss=35.4]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:33<00:00, 53.44s/it, loss=23.7]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:42<00:00, 55.70s/it, loss=22.5]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:36<00:00, 54.14s/it, loss=33.9]
Epoch 5: 100%|██████████████████████████████████████████████████████████████████| 4/4 [03:56<00:00, 59.01s/it, loss=21]
Epoch 6: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:51<00:00, 57.87s/it, loss=26.9]
100%|█████████████████████████████████████████████████████████████████████████████

| [0m 11      [0m | [0m 0.5     [0m | [0m 16.87   [0m | [0m 6.918   [0m | [0m 0.03972 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 3/3 [03:29<00:00, 69.71s/it, loss=57.6]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 3/3 [18:36<00:00, 372.30s/it, loss=21.8]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 3/3 [04:23<00:00, 87.80s/it, loss=41.8]
Epoch 3: 100%|███████████████████████████████████████████████████████████████| 3/3 [13:53<00:00, 277.74s/it, loss=38.3]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:04<00:00,  1.24it/s]


| [0m 12      [0m | [0m 0.75    [0m | [0m 19.62   [0m | [0m 4.057   [0m | [0m 0.07224 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 6/6 [09:47<00:00, 97.85s/it, loss=30.8]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 6/6 [03:00<00:00, 30.12s/it, loss=20.4]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 6/6 [02:47<00:00, 27.98s/it, loss=20.4]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 6/6 [02:46<00:00, 27.71s/it, loss=29.1]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.65it/s]


| [0m 13      [0m | [0m 0.5     [0m | [0m 10.12   [0m | [0m 4.047   [0m | [0m 0.04269 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|█████████████████████████████████████████████████████████████████| 5/5 [09:58<00:00, 119.67s/it, loss=13]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 5/5 [03:04<00:00, 36.83s/it, loss=52.4]
Epoch 2: 100%|███████████████████████████████████████████████████████████████| 5/5 [15:05<00:00, 181.18s/it, loss=21.2]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.64it/s]


| [0m 14      [0m | [0m 0.3333  [0m | [0m 13.29   [0m | [0m 2.854   [0m | [0m 0.05531 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 2/2 [03:15<00:00, 97.53s/it, loss=24.4]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 2/2 [17:46<00:00, 533.38s/it, loss=39.9]
Epoch 2: 100%|███████████████████████████████████████████████████████████████| 2/2 [05:56<00:00, 178.31s/it, loss=24.8]
Epoch 3: 100%|███████████████████████████████████████████████████████████████| 2/2 [04:01<00:00, 120.63s/it, loss=54.3]
Epoch 4: 100%|███████████████████████████████████████████████████████████████| 2/2 [05:40<00:00, 170.14s/it, loss=22.9]
Epoch 5: 100%|█████████████████████████████████████████████████████████████████| 2/2 [04:02<00:00, 121.31s/it, loss=34]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.57it/s]


| [0m 15      [0m | [0m 0.6     [0m | [0m 30.1    [0m | [0m 6.225   [0m | [0m 0.05285 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|██████████████████████████████████████████████████████████████████| 3/3 [03:22<00:00, 67.39s/it, loss=38]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 3/3 [12:51<00:00, 257.08s/it, loss=18.5]
Epoch 2: 100%|███████████████████████████████████████████████████████████████| 3/3 [12:48<00:00, 256.02s/it, loss=24.3]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 3/3 [03:44<00:00, 74.87s/it, loss=24.7]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 3/3 [03:43<00:00, 74.62s/it, loss=16.2]
Epoch 5: 100%|██████████████████████████████████████████████████████████████████| 3/3 [03:36<00:00, 72.19s/it, loss=17]
Epoch 6: 100%|████████████████████████████████████████████████████████████████| 3/3 [03:41<00:00, 73.91s/it, loss=13.2]
100%|█████████████████████████████████████████████████████████████████████████████

| [0m 16      [0m | [0m 0.5     [0m | [0m 23.03   [0m | [0m 6.728   [0m | [0m 0.03767 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 3/3 [03:44<00:00, 74.89s/it, loss=29.5]
Epoch 1: 100%|███████████████████████████████████████████████████████████████| 3/3 [15:02<00:00, 300.73s/it, loss=19.6]
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:03<00:00,  1.62it/s]


| [0m 17      [0m | [0m 0.5     [0m | [0m 19.65   [0m | [0m 2.285   [0m | [0m 0.03762 [0m |


Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


  return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
Epoch 0: 100%|████████████████████████████████████████████████████████████████| 3/3 [03:34<00:00, 71.59s/it, loss=52.4]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 3/3 [04:05<00:00, 81.84s/it, loss=75.5]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 3/3 [03:43<00:00, 74.44s/it, loss=49.2]
Epoch 3: 100%|█████████████████████████████████████████████████████████████████| 3/3 [16:14<00:00, 324.91s/it, loss=58]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 3/3 [04:35<00:00, 91.94s/it, loss=56.9]
Epoch 5: 100%|███████████████████████████████████████████████████████████████| 3/3 [08:46<00:00, 175.51s/it, loss=43.5]
Epoch 6: 100%|███████████████████████████████████████████████████████████████| 3/3 [06:03<00:00, 121.05s/it, loss=36.1]
Epoch 7: 100%|████████████████████████████████████████████████████████████████| 3/

| [0m 18      [0m | [0m 0.6667  [0m | [0m 25.29   [0m | [0m 8.747   [0m | [0m 0.07063 [0m |
----------------------------------------------------------------------------------------------------
Final Results


[{'target': 0.3333333333333333,
  'params': {'batch_size': 29.553730742713505,
   'epochs': 9.38814220981831,
   'learning_rate': 0.09016813746748065}},
 {'target': 0.5,
  'params': {'batch_size': 31.21126164807619,
   'epochs': 3.4360530485231484,
   'learning_rate': 0.08290607763622375}},
 {'target': 0.6666666666666666,
  'params': {'batch_size': 25.82886959124032,
   'epochs': 2.568654217598951,
   'learning_rate': 0.03509096549672737}},
 {'target': 0.5,
  'params': {'batch_size': 14.082918238288205,
   'epochs': 7.8770800943546,
   'learning_rate': 0.02062272812911489}},
 {'target': 0.75,
  'params': {'batch_size': 16.58486990317661,
   'epochs': 9.22325344967566,
   'learning_rate': 0.050416789012716325}},
 {'target': 1.0,
  'params': {'batch_size': 14.546609252900712,
   'epochs': 9.020573691174002,
   'learning_rate': 0.00634060305118127}},
 {'target': 0.5,
  'params': {'batch_size': 26.46036422764427,
   'epochs': 9.366552274643832,
   'learning_rate': 0.09626301422230656}},
 {

In [38]:
opt.max

{'target': 1.0,
 'params': {'batch_size': 14.546609252900712,
  'epochs': 9.020573691174002,
  'learning_rate': 0.00634060305118127}}

### 4. Train and save best model

##### 4.1 Train best model

In [None]:
# Train best model

# Prepare data
BERT_tuner=BERTtuner()
BERT_tuner.load_from_web(model_name='bert-base-uncased')

bert_inputs_combaug=BERT_tuner.prepare_data_for_BERT_train(ran_sent_pairs_combaug,raw_sent_notpairs_combaug)
bert_inputs_combaug

# Train
m,loss=BERT_tuner.train_BERT(bert_inputs_combaug,epochs=9,batch_size=16,learning_rate=0.006)

Some weights of BertForPreTraining were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Model bert-base-uncased had been loaded.


Epoch 0: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:35<00:00, 38.75s/it, loss=19.6]
Epoch 1: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:46<00:00, 41.65s/it, loss=1.76]
Epoch 2: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:47<00:00, 41.97s/it, loss=1.64]
Epoch 3: 100%|████████████████████████████████████████████████████████████████| 4/4 [02:56<00:00, 44.01s/it, loss=1.59]
Epoch 4: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:06<00:00, 46.54s/it, loss=1.47]
Epoch 5: 100%|████████████████████████████████████████████████████████████████| 4/4 [03:14<00:00, 48.66s/it, loss=1.46]
Epoch 6:  50%|████████████████████████████████                                | 2/4 [01:48<01:48, 54.33s/it, loss=1.54]

In [None]:
# Save embeddings

query_matcher=QueryMatcher(m,BERT_tuner.model_tokenizer,False,True) # Init query matcher class
question_embeddings=query_matcher.calc_question_embeddings(questions_list,save=True) # Calcuate embeddings

##### 4.2 Save best model and tokenizer

In [None]:
# Save model

torch.save(m, MODEL_PATH.joinpath('best_model'))
torch.save(query_matcher.tokenizer, MODEL_PATH.joinpath('best_model_t'))