# Learning progress: Evaluación del BERT a lo largo de las etapas del finetuning

Lo vamos a hacer primero con un conjunto de prueba:
* Abstracts de los que se tomó la muestra: [abstracts.csv](../../pubmed-queries/abstracts/abstracts.csv)
* Muestra fenotipos/etiquetas: [index-phenotypes.csv](../../pubmed-queries/abstracts/index-phenotypes.csv)
* Fenotipos test: [phenotypes-22-12-15.csv](../pubmed-queries/results/phenotypes-22-12-15.csv) = nodos hoja HPO:PhenotypicAbnormality

Pasos a seguir:
1. Cargar todos los datos:
  * BERT de partida.
  * Ontología.
  * Datos crudos de entrenamiento (abstracts+fenotipos).
2. Preparar el entrenamiento:
  * Datos procesados de train (dataloaders)
  * Función de pérdida: BatchAllTripletLoss
  * Función de evaluación:
    * a) EmbeddingSimilarityEvaluator
      * Preparar pares de fenotipos (train/test?)
      * Calcular gold scores
    * b) Implementarla (SentenceEvaluator) con la funcionalidad:
      * Calcular MSE y correlación Train/Test.
      * Escribir datos en un csv.
      * Devolver la correlación de Test.
3. Fit: probar en el servidor y guardar los resultados.
4. Out: mostrar e interpretar los resultados.
  * Gráfica MSE / etapa (train/test)
  * Gráfica correlación / etapa (train/test)

## 1. Cargar todos los datos

In [10]:
# IMPORTS
from cmath import nan
import sentence_transformers
import torch
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [11]:
# 1. Cargar todos los datos (crudos)
from pyhpo import Ontology
import os

SRCPATH = '../../'
SEED = 42

# 1.1 BERT de partida

PRITAMDEKAMODEL = 'pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb'
bertmodel = SentenceTransformer(PRITAMDEKAMODEL)

# 1.2 Ontología

onto = Ontology(SRCPATH+ '/pubmed-queries/hpo-22-12-15-data')

# 1.3 Datos crudos de entrenamiento

PATH_DATA = SRCPATH + '/pubmed-queries/abstracts'
PATH_DATA_CSV = PATH_DATA + '/abstracts.csv'
PATH_DATA_FENOTIPOS = SRCPATH + '/pubmed-queries/results/phenotypes-22-12-15.csv'
PATH_INDEX_FENOTIPOS = PATH_DATA + '/index-phenotypes.csv'

# abstracts
dfPapers = pd.read_csv(PATH_DATA_CSV, sep='\t', low_memory=False, na_values=['', nan])
# fenotipos test
dfPhenotypes = pd.read_csv(PATH_DATA_FENOTIPOS, sep=';', low_memory=False, na_values=['', nan])
# fenotipos train
dfIndex = pd.read_csv(PATH_INDEX_FENOTIPOS, sep='\t', low_memory=False, na_values=['', nan])

# Guardar en directorio manejable
PATH_TRAINDATA = SRCPATH + '/traindata'

# crear directorios si no existen
for dir in [PATH_TRAINDATA, PATH_TRAINDATA + '/abstracts',
            PATH_TRAINDATA + '/phenotypes', PATH_TRAINDATA + '/onto']:
    if not os.path.exists(dir):
        os.makedirs(dir)

dfPapers.to_csv(PATH_TRAINDATA + '/abstracts/abstracts.csv', sep='\t', index=False)
dfPhenotypes.to_csv(PATH_TRAINDATA + '/phenotypes/phenotypes.csv', sep=';', index=False)
dfIndex.to_csv(PATH_TRAINDATA + '/phenotypes/index.csv', sep='\t', index=False)

# copiar ontology al directorio tambien
os.system('cp -r ' + SRCPATH + '/pubmed-queries/hpo-22-12-15-data ' + PATH_TRAINDATA + '/onto')


0

## 2. Preparar el entrenamiento

In [12]:
# Procesar los datos de train/loss/test

# 2.1. Clean abstracts

# Tratar NA's en la columna abstracts -> cambiar por el título
def getPhenDesc(phenotypeName):
    hpoNode = onto.get_hpo_object(phenotypeName) 
    description = hpoNode.definition if hpoNode.definition else '""'
    return description

print('Na\'s:', dfPapers['abstract'].isna().sum())
dfPapers['abstract'] = dfPapers['abstract'].fillna(dfPapers['title'])

# Función clean abstract

# Download the stopwords from NLTK
PATH_NTLK = SRCPATH + '/traindata/nltk'
if not os.path.exists(PATH_NTLK):
    os.makedirs(PATH_NTLK)

os.environ['NLTK_DATA'] = PATH_NTLK

nltk.download('punkt')
nltk.download('stopwords')

cached_stopwords = stopwords.words('english')
def clean_abstract(abstract):
    if isinstance(abstract, float) and np.isnan(abstract):
        return ''
    # Convert the text to lowercase
    abstract = abstract.lower()

    # Remove punctuation
    abstract = abstract.translate(str.maketrans('', '', string.punctuation))

    # Tokenize the text
    tokens = word_tokenize(abstract)

    # Remove stopwords
    tokens = [word for word in tokens if not word in cached_stopwords]

    # Join the tokens back into a single string
    abstract = ' '.join(tokens)

    return abstract

# Save clean abstracts csv

dfPapers['clean_abstract'] = dfPapers['abstract'].apply(clean_abstract)
dfPapers.drop(columns=['abstract'], inplace=True)
dfPapers.to_csv(PATH_TRAINDATA + '/abstracts/abstracts-clean.csv', sep='\t', index=False)

print('Clean abstracts in: ', PATH_TRAINDATA + '/abstracts/abstracts-clean.csv')
print('Total abstracts: ', len(dfPapers))

# 2.2 Tags = phenotypes (train)

tags = dfIndex['phenotypeName']
numlabels = len(tags)
print('Number of phenotype tags:', numlabels)
mapping = {tag: i for i, tag in enumerate(tags)}

def getLabelNumber(phenotypeName):
    return mapping[phenotypeName]

Na's: 1854


[nltk_data] Downloading package punkt to /home/domingo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/domingo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Clean abstracts in:  ../..//traindata/abstracts/abstracts-clean.csv
Total abstracts:  23226
Number of phenotype tags: 100


### 2.3 Profiling
Tomamos solo una muestra del 5% de los abstracts totales para realizar las pruebas más rápido. Cuando funcionen las pruebas se cambia para la versión oficial del experimento.

In [13]:
from torch.utils.data import DataLoader, Dataset
from sentence_transformers import SentenceTransformer, SentencesDataset, losses, evaluation, InputExample

PROFILING = True
SAMPLEPERCENT = 0.05 if PROFILING else 1.0
SEED = 42
torch.manual_seed(SEED)

train = dfPapers.sample(frac=0.1, random_state=SEED)
numexamples = len(train)
print('Number of train examples:', numexamples)



Number of train examples: 2323
