<a href="https://colab.research.google.com/github/isaacmg/common/blob/sentence_notebook/Sentence%20Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training Sentence Transformers for COVID-19
This is a basic notebook to train sentence transformers on the MedNLI and SciTail datasets. This notebook was run on Google Colaboratory Pro using a single GPU. For a more general tutorial on why [sentence embeddings might be useful see here](https://www.kaggle.com/isaacmg/scibert-embeddings). Training times were generally less than two hours (except when training on AllNLI). The finalized weights can be found on Kaggle (see link in Coronawhy). Note since there is no way to validate results concretely at the moment there are several different weights you can use. Sorry for the somewhat messy code as I only had access to GPUs on colab I didn't bother refactoring into scripts and methods.

In [0]:
!git clone https://github.com/UKPLab/sentence-transformers
import os 
import json
from google.colab import auth
auth.authenticate_user()

Cloning into 'sentence-transformers'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (15/15), done.[K
remote: Total 729 (delta 8), reused 3 (delta 0), pack-reused 714[K
Receiving objects: 100% (729/729), 208.14 KiB | 379.00 KiB/s, done.
Resolving deltas: 100% (490/490), done.


In [0]:
os.chdir('sentence-transformers')
!pip install -r requirements.txt
!python examples/datasets/get_data.py

Extract wikipedia-sections-triplets.zip
All datasets downloaded and extracted


In [0]:
from torch.utils.data import DataLoader
import math
from sentence_transformers import models, losses
from sentence_transformers import SentencesDataset, LoggingHandler, SentenceTransformer
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.readers import *
import logging
from datetime import datetime

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
# Download relevant data to directories
!mkdir data 
!gsutil cp gs://qa_research/scitail_1.0_train.txt data/scitail_1.0_train.jsonl
!gsutil cp gs://qa_research/mli_train_v1.jsonl data/mli_train_v1.jsonl

Copying gs://qa_research/scitail_1.0_train.txt...
\ [1 files][ 27.2 MiB/ 27.2 MiB]                                                
Operation completed over 1 objects/27.2 MiB.                                     
Copying gs://qa_research/mli_train_v1.jsonl...
- [1 files][ 10.5 MiB/ 10.5 MiB]                                                
Operation completed over 1 objects/10.5 MiB.                                     


## Pre-training on AllNLI
Even though the AllNI dataset isn't scientific it can still form the basis of useful pre-training.

In [0]:
# We will use the recent model by deepset as the base LM
model_name = "deepset/covid_bert_base"
model_save_path = 'pretrained_'+model_name+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

# Use BERT for mapping tokens to embeddings
word_embedding_model = models.BERT(model_name)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                               pooling_mode_mean_tokens=True,
                               pooling_mode_cls_token=False,
                               pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])


# Initialize the AllNLI training data
logging.info("Read AllNLI train dataset")
batch_size = 16
nli_reader = NLIDataReader("examples/datasets/AllNLI")
train_data_nli = SentencesDataset(nli_reader.get_examples('train.gz'), model=model)
train_dataloader_nli = DataLoader(train_data_nli, shuffle=True, batch_size=batch_size)
train_loss_nli = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=nli_reader.get_num_labels())


In [0]:
# We will use STSDataReader for all of the evaluation 
sts_reader = STSDataReader('examples/datasets/stsbenchmark')
dev_data = SentencesDataset(examples=sts_reader.get_examples('sts-dev.csv'), model=model)
dev_dataloader = DataLoader(dev_data, shuffle=False, batch_size=batch_size)
evaluator = EmbeddingSimilarityEvaluator(dev_dataloader)

Convert dataset: 100%|██████████| 1500/1500 [00:00<00:00, 1520.96it/s]

2020-03-30 19:19:23 - Num sentences: 1500
2020-03-30 19:19:23 - Sentences 0 longer than max_seqence_length: 0
2020-03-30 19:19:23 - Sentences 1 longer than max_seqence_length: 0





In [0]:
warmup_steps = math.ceil(len(train_dataloader_nli) * 1 / batch_size * 0.1)
model.fit(train_objectives=[(train_dataloader_nli, train_loss_nli)],
          evaluator=evaluator,
          epochs=1,
          evaluation_steps=1000,
          warmup_steps=warmup_steps,
          output_path=model_save_path)

## Part II Training on scientific datasets

In [0]:
from . import InputExample
import csv
import gzip
import os

class MedNLIReader(object):
    """
    Reads in the Stanford NLI dataset and the MultiGenre NLI dataset
    """
    def __init__(self, dataset_folder):
        self.json_path = dataset_folder

    def get_examples(self, filename, max_examples=0):
        """
        data_splits specified which data split to use (train, dev, test).
        Expects that self.dataset_folder contains the files s1.$data_split.gz,  s2.$data_split.gz,
        labels.$data_split.gz, e.g., for the train split, s1.train.gz, s2.train.gz, labels.train.gz
        """

        examples = []
        id = 0
        with open(self.json_path) as y:
          for line in y:
            line_final = json.loads(line)
            sentence_a = line_final["sentence1"]
            sentence_b = line_final["sentence2"]
            label = line_final["gold_label"]
            guid = "%s-%d" % (filename, id)
            id += 1
            examples.append(InputExample(guid=guid, texts=[sentence_a, sentence_b], label=self.map_label(label)))

            if 0 < max_examples <= len(examples):
                break

        return examples

    @staticmethod
    def get_labels():
        return {"contradiction": 0, "entailment": 1, "neutral": 2}

    def get_num_labels(self):
        return len(self.get_labels())

    def map_label(self, label):
        return self.get_labels()[label.strip().lower()]
# Read the dataset
nli_reader = MedNLIReader('data/scitail_1.0_train.jsonl')
nli_reader_mednli = MedNLIReader('data/mli_train_v1.jsonl')
train_num_labels = nli_reader.get_num_labels()

## Train on SCI-Tail

In [0]:
# Convert the dataset to a DataLoader ready for training
logging.info("Read SCITail Dataset")
train_data = SentencesDataset(nli_reader.get_examples('scitail_1.0_train.jsonl'), model=model)
train_dataloader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
train_loss = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=train_num_labels)
logging.info("Read STSbenchmark dev dataset")


# Configure the training
num_epochs = 2

warmup_steps = math.ceil(len(train_dataloader) * num_epochs / batch_size * 0.1) #10% of train data for warm-up
logging.info("Warmup-steps: {}".format(warmup_steps))



# Train the model
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=1000,
          warmup_steps=warmup_steps,
          output_path=model_save_path
          )



##############################################################################
#
# Load the stored model and evaluate its performance on STS benchmark dataset
#
##############################################################################

model = SentenceTransformer(model_save_path)
test_data = SentencesDataset(examples=sts_reader.get_examples("sts-test.csv"), model=model)
test_dataloader = DataLoader(test_data, shuffle=False, batch_size=batch_size)
evaluator = EmbeddingSimilarityEvaluator(test_dataloader)

model.evaluate(evaluator)

## Train on Med-NLI

In [0]:
logging.info("Read MedNLI Dataset")
train_data_med = SentencesDataset(nli_reader_mednli.get_examples('mli_train_v1.jsonl'), model=model)
train_dataloader_med = DataLoader(train_data_med, shuffle=True, batch_size=batch_size)

In [0]:
train_num_labels_med = nli_reader_mednli.get_num_labels()
train_loss_med = losses.SoftmaxLoss(model=model, sentence_embedding_dimension=model.get_sentence_embedding_dimension(), num_labels=train_num_labels_med)
model.fit(train_objectives=[(train_dataloader_med, train_loss_med)],
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=1000,
          warmup_steps=warmup_steps,
          output_path="model_results2"
          )

In [0]:
!gsutil cp -r model_results2 gs://coronaviruspublicdata/today/model_results

Copying file://model_results2/modules.json [Content-Type=application/json]...
Copying file://model_results2/similarity_evaluation_results.csv [Content-Type=text/csv]...
Copying file://model_results2/config.json [Content-Type=application/json]...
Copying file://model_results2/1_Pooling/config.json [Content-Type=application/json]...
\
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying file://model_results2/0_BERT/sentence_bert_config.json [Content-Type=application/json]...
Copying file://model_results2/0_BERT/added_tokens.json [Content-Type=application/json]...
Copying file://model_results2/0_BERT/pytorch_model.bin [Content-Type=application/octet-stream]...
==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite up

In [0]:
!gsutil cp -r model_results2 gs://coronaviruspublicdata

## Qualitative Evaluation

In [0]:
#!pip install scikit_learn
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
sents = model.encode(["The SARS 3C-like proteinase (SARS-3CLpro), which is the main proteinase of the SARS coronavirus, is essential to the virus life cycle. This enzyme has been shown to be active as a dimer in which only one protomer is active. However, it remains unknown how the dimer structure maintains an active monomer conformation. It has been observed that the Ser139-Leu141 loop forms a short 3 10 -helix that disrupts the catalytic machinery in the inactive monomer structure. We have tried to disrupt this helical conformation by mutating L141 to T in the stable inactive monomer G11A/R298A/Q299A. The resulting tetra-mutant G11A/L141T/R298A/Q299A is indeed enzymatically active as a monomer. Molecular dynamics simulations revealed that the L141T mutation disrupts the 3 10 -helix and helps to stabilize the active conformation. The coil-3 10 -helix conformational transition of the Ser139-Leu141 loop serves as an enzyme activity switch. Our study therefore indicates that the dimer structure can stabilize the active conformation but is not a required structure in the evolution of the active enzyme, which can also arise through simple mutations", "DNA sequences seen in the normal character-based representation appear to have a formidable mixing of the four nucleotides without any apparent order. Nucleotide frequencies and distributions in the sequences have been studied extensively, since the simple rule given by Chargaff almost a century ago that equates the total number of purines to the pyrimidines in a duplex DNA sequence. While it is difficult to trace any relationship between the bases from studies in the character representation of a DNA sequence, graphical representations may provide a clue. These novel representations of DNA sequences have been useful in providing an overview of base distribution and composition of the sequences and providing insights into many hidden structures. We report here our observation based on a graphical representation that the intra-purine and intra-pyrimidine differences in sequences of conserved genes generally follow a quadratic distribution relationship and show that this may have arisen from mutations in the sequences over evolutionary time scales. From this hitherto undescribed relationship for the gene sequences considered in this report we hypothesize that such relationships may be characteristic of these sequences and therefore could become a barrier to large scale sequence alterations that override such characteristics, perhaps through some monitoring process inbuilt in the DNA sequences. Such relationship also raises the possibility of intron sequences playing an important role in maintaining the characteristics and could be indicative of possible intron-late phenomena"])
#query = model.encode(["chrlorquine COVID-2019 treatment"])
cosine_similarity(np.expand_dims(sents[0], axis=0), np.expand_dims(sents[1], axis=0))
#scipy.spatial.distance.cdist(sents[0], sents[1], "cosine")[0]



Batches: 100%|██████████| 1/1 [00:00<00:00, 22.78it/s]


array([[0.825353]], dtype=float32)

In [0]:
from sklearn.metrics.pairwise import cosine_similarity
sents = model.encode(["coronavirus", "mers", "sars", "covid-19", "dog"])
print(cosine_similarity(np.expand_dims(sents[0], axis=0), np.expand_dims(sents[1], axis=0)))
print(cosine_similarity(np.expand_dims(sents[0], axis=0), np.expand_dims(sents[2], axis=0)))
print(cosine_similarity(np.expand_dims(sents[0], axis=0), np.expand_dims(sents[3], axis=0)))
print(cosine_similarity(np.expand_dims(sents[0], axis=0), np.expand_dims(sents[4], axis=0)))



Batches: 100%|██████████| 1/1 [00:00<00:00, 32.88it/s]

[[0.39761117]]
[[0.6937201]]
[[0.6606833]]
[[0.1408183]]





In [0]:
#!pip install scikit_learn
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
sents = model.encode(["Dogs eating food", "coronavirus treatment", "chrlorquine COVID-2019 treatment"])
cosine_similarity(np.expand_dims(sents[1], axis=0), np.expand_dims(sents[2], axis=0))
#scipy.spatial.distance.cdist(sents[0], sents[1], "cosine")[0]



Batches: 100%|██████████| 1/1 [00:00<00:00, 33.80it/s]


array([[0.84475654]], dtype=float32)

In [0]:
new_sents = model.encode(["treatment efffiacy of chrlorquine on COVID patients", "bat to human transmission mechanism of coronaviruses"])



Batches: 100%|██████████| 1/1 [00:00<00:00, 26.25it/s]


In [0]:
cosine_similarity(np.expand_dims(new_sents[0], axis=0), np.expand_dims(new_sents[1], axis=0))

array([[0.5759958]], dtype=float32)

In [0]:
new_sents = model.encode(["bat to human transmission mechanism of coronaviruses", "camel to human transmission mechanism of middle east respiratory viruses"])



Batches: 100%|██████████| 1/1 [00:00<00:00, 31.36it/s]


In [0]:
cosine_similarity(np.expand_dims(new_sents[0], axis=0), np.expand_dims(new_sents[1], axis=0))

array([[0.8909222]], dtype=float32)