# BERT 101: Fine tuning
* Objetivo: Aprender cómo se hace el fine-tuning de https://huggingface.co/pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb para la tarea del clasificador gen-fenotipo.  
TODO: funcionamiento, dataset, evaluación, entrenamiento.  
* Fuentes: https://www.sbert.net/index.html  
 
## 1. Obtener un embedding

In [1]:
from sentence_transformers import SentenceTransformer
import numpy as np

BERTBASE =  'sentence-transformers/stsb-bert-base'
PRITAMDEKAMODEL = 'pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb'
bertmodel = SentenceTransformer(PRITAMDEKAMODEL)
model = bertmodel

# Sentences we want to encode. Example:
sentence = ['This framework generates embeddings for each input sentence']

# Sentences are encoded by calling model.encode()
embedding = model.encode(sentence)
print(embedding)
print("N tokens=", len(embedding))
print("Longitud=", len(embedding[0]))
print("Norm=", np.linalg.norm(embedding))

[[-4.76342797e-01  2.08358914e-01  5.11056483e-01 -9.67204049e-02
  -3.82982075e-01 -7.21406844e-03  9.32689905e-02 -2.83180773e-01
   2.75494218e-01 -2.27120504e-01 -1.71401396e-01 -3.90078396e-01
  -8.76777530e-01 -1.11652412e-01 -1.34179574e-02 -1.79456607e-01
  -2.06889641e-02 -3.93742293e-01  1.02231696e-01  1.07746884e-01
   4.20903713e-01  3.29416484e-01  3.86795551e-01  1.77757367e-01
  -1.11435950e+00 -6.60615861e-01  8.67752373e-01 -7.14428365e-01
   1.35926574e-01  6.41943276e-01 -1.44992888e-01 -2.19857797e-01
   9.68627110e-02 -8.06883693e-01 -5.75907826e-01  2.01212510e-01
  -2.60848373e-01  4.70776021e-01 -7.82865763e-01 -4.56847250e-01
  -3.76819313e-01 -1.85794041e-01 -2.74007201e-01  1.16551824e-01
   1.24486730e-01 -7.58684501e-02 -1.05840497e-01 -7.11202389e-03
  -4.74651694e-01 -4.52506691e-01 -2.94009358e-01 -4.54989791e-01
   5.98688483e-01  1.09791744e+00  9.20965612e-01 -5.48521876e-01
  -9.60874185e-02  2.86660165e-01  9.01634276e-01 -9.59560454e-01
   5.92173

También podemos generar el embedding para una lista de frases.

In [2]:
# Our sentences we like to encode
sentences = [
    "This framework generates embeddings for each input sentence",
    "Sentences are passed as a list of string.",
    "The quick brown fox jumps over the lazy dog.",
]

# Sentences are encoded by calling model.encode()
sentence_embeddings = model.encode(sentences)

# Print the embeddings
for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding[:10], "...")
    print("Norm=", np.linalg.norm(embedding))
    print("")

Sentence: This framework generates embeddings for each input sentence
Embedding: [-0.47634262  0.20835875  0.5110564  -0.09672066 -0.38298213 -0.00721423
  0.09326907 -0.28318128  0.27549437 -0.22712018] ...
Norm= 14.269959

Sentence: Sentences are passed as a list of string.
Embedding: [ 0.24558467 -0.38048622  1.1098257  -0.44728062  0.23018073 -0.29509577
  0.9107106  -0.13001063 -0.01333597 -0.33322632] ...
Norm= 16.170506

Sentence: The quick brown fox jumps over the lazy dog.
Embedding: [ 0.43256274 -0.10374534 -0.42705497 -0.5431257   0.22969241 -0.5625215
 -0.6229514  -1.2054772   1.3298908   0.04342185] ...
Norm= 15.025292



# 2. Comparar embeddings

* Similitud coseno:

In [3]:
from sentence_transformers import util

# Sentences are encoded by calling model.encode()
emb1 = model.encode("This is a red cat with a hat.")
emb2 = model.encode("Have you seen my red cat?")

cos_sim = util.cos_sim(emb1, emb2)
print("Cosine-Similarity:", cos_sim)

Cosine-Similarity: tensor([[0.5908]])


Otro ejemplo para más de 1 pareja:

In [4]:
sentences = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "Someone in a gorilla costume is playing a set of drums.",
]

# Encode all sentences
embeddings = model.encode(sentences)

# Compute cosine similarity between all pairs
cos_sim = util.cos_sim(embeddings, embeddings)

# Add all pairs to a list with their cosine similarity score
all_sentence_combinations = []
for i in range(len(cos_sim) - 1):
    for j in range(i + 1, len(cos_sim)):
        all_sentence_combinations.append([cos_sim[i][j], i, j])

# Sort list by the highest cosine similarity score
all_sentence_combinations = sorted(all_sentence_combinations, key=lambda x: x[0], reverse=True)

print("Top-5 most similar pairs:")
for score, i, j in all_sentence_combinations[0:5]:
    print("{} \t {} \t {:.4f}".format(sentences[i], sentences[j], cos_sim[i][j]))

Top-5 most similar pairs:
A man is riding a horse. 	 A man is riding a white horse on an enclosed ground. 	 0.7676
A man is eating food. 	 A man is eating a piece of bread. 	 0.6427
A monkey is playing drums. 	 Someone in a gorilla costume is playing a set of drums. 	 0.5967
A woman is playing violin. 	 A monkey is playing drums. 	 0.3307
A woman is playing violin. 	 Someone in a gorilla costume is playing a set of drums. 	 0.3277


* Distancia de Resnik en la ontología: para 2 fenotipos

In [5]:
from pyhpo import Ontology
import pandas as pd
# initilize the Ontology ()
onto = Ontology('../pubmed-queries/hpo-22-12-15-data')

PATH_DATA_FENOTIPOS = '../pubmed-queries/results/phenotypes-22-12-15.csv'
dfPhenotypes = pd.read_csv(PATH_DATA_FENOTIPOS, sep=';', low_memory=False, na_values=[''])

SEED = 42


In [6]:
# get 2 phenotypes
SEED = 42

muestraPhenotypes = dfPhenotypes.sample(2, random_state=SEED)

pairs = []
pheno1 = onto.get_hpo_object(muestraPhenotypes.iloc[0]['Id'])
pheno2 = onto.get_hpo_object(muestraPhenotypes.iloc[1]['Id'])
pairs.append([pheno1, pheno2])
pheno1 = onto.get_hpo_object('Compensatory scoliosis')
pheno2 = onto.get_hpo_object('Kyphoscoliosis')
pairs.append([pheno1, pheno2])

for pheno1, pheno2 in pairs:
    # get Resnik similarity
    sim = pheno1.similarity_score(pheno2, method='resnik')
    sim2 = pheno1.similarity_score(pheno2, method='lin')

    # get cosine similarity
    embedding1 = model.encode(pheno1.name)
    embedding2 = model.encode(pheno2.name)
    cosim = util.cos_sim(embedding1, embedding2)

    # get names of the phenotypes
    print(pheno1.name, ",", pheno2.name, ":Resnik=", sim, "Lin=", sim2,"Cosine=", cosim)


Temporomandibular joint ankylosis , Dyslexia :Resnik= 0.00036046861309514645 Lin= 4.885073886735671e-05 Cosine= tensor([[0.1194]])
Compensatory scoliosis , Kyphoscoliosis :Resnik= 2.275796718576765 Lin= 1.1389062155173746 Cosine= tensor([[0.7856]])


## 3. Entrenar el modelo
* References: https://www.sbert.net/docs/training/overview.html https://www.sbert.net/docs/package_reference/models.html    

### Crear una arquitectura
Los Sentence Transformers ya vienen como embedding+pooling, así que devuelven un único vector de 768 componentes. El embedding es lo que separa en tokens y devuelve vectores que codifican el texto, el pooling los colapsa en uno solo. Podemos definir nuestra propia arquitectura con distintos embeddings y pooling models. Notar que en el ejemplo el embedding tiene una longitud máxima de secuencia, en el ejemplo final NO DEBERÍA.

In [7]:
# Ejemplo: Arquitectura base+pooling
from sentence_transformers import models

word_embedding_model = models.Transformer("bert-base-uncased", max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

text = "I do not have my eyes closed"
sentence_embedding = model.encode(text)
print("Sentence:", text)
print("Embedding:", sentence_embedding[:10], "...")
print("Norm=", np.linalg.norm(sentence_embedding))

Sentence: I do not have my eyes closed
Embedding: [ 0.19926079  0.40846884  0.01355856 -0.34057072 -0.01478832 -0.10423365
  0.13928898  0.44513947 -0.24763162 -0.49960858] ...
Norm= 8.877091


En https://www.sbert.net/docs/package_reference/models.html tenemos información concreta de qué módulos hay disponibles. Por ejemplo, para una arquitectura con embeddings del biobert pritamdeka, pooling y normalización pondríamos:

In [8]:
normalization_model = models.Normalize()
model = bertmodel
model.append(normalization_model)

sentence_embedding = model.encode(text)
print("Sentence:", text)
print("Embedding:", sentence_embedding[:10], "...")
print("Norm=", np.linalg.norm(sentence_embedding))

Sentence: I do not have my eyes closed
Embedding: [-0.01004861  0.04328241  0.00344129  0.02131966 -0.01093566 -0.00743332
  0.01978693 -0.01606919 -0.03233068 -0.01528868] ...
Norm= 1.0


* Dudas: en principio creía que el modelo que estaba usando JLM estaba normalizado, y por eso comparábamos con la similitud coseno en nuestro planteamiento inicial. Veo que no lo está, entonces es conveniente normalizar el embedding final? Qué efecto tiene hacerlo después del pooling? Y antes también?  
### Datos de entrenamiento
El ingrediente que falta para entrenar el modelo es la similitud entre 2 frases. Ejemplos de datasets: https://www.sbert.net/examples/training/datasets/README.html

In [9]:
from sentence_transformers import InputExample, losses
import torch
from torch.utils.data import DataLoader

torch.manual_seed(SEED)

model = bertmodel
model.append(normalization_model)
train_examples = [
    InputExample(texts=["My first sentence", "My second sentence"], label=0.8),
    InputExample(texts=["Another pair", "Unrelated sentence"], label=0.3),
    InputExample(texts=["My second sentence", "My third sentence"], label=0.1),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

text = "My third sentence"
embedding1 = model.encode(text)
train_loss = losses.CosineSimilarityLoss(model)

# Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=3)
embedding2 = model.encode(text)
print("Diferencia: ", np.linalg.norm(embedding2-embedding1))
# Obtenemos una pequeña diferencia respecto al embedding original

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

Diferencia:  2.8339291e-05


* Duda: ¿Cómo se consigue reproducibilidad?