# Parte II. Afinamiento (fine-tuning) del modelo BERT

**Integrantes - Grupo BSAC Covid 19 (Saturdays AI UIO)**
<table align="left">
  <tr>
    <td>Sandra Torres</td>
    <td>Wendy Jara</td>    
  </tr>
  <tr>
    <td>Edwin Rodriguez</td>
    <td>Christian Pichucho</td>
  </tr>
  <tr>
    <td>Jorge Vargas</td>
    <td>Milton Fonseca</td>
  </tr>
  <tr>
    <td>Sebastián Ayala</td>
    <!--<td bgcolor="LightGray">Coordinador</td>-->
    <td><i>*Coordinador</i></td>
  </tr> 
</table>  

El objetivo del presente trabajo, es realizar un afinamiento (fine-tuning) del modelo pre-entrenado **BioBERT-Base v1.1 (+ PubMed 1M)** disponible en https://github.com/dmis-lab/biobert sobre el conjunto de datos provisto por Kaggle en su reto <a href="https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge">COVID-19 Open Research Dataset Challenge (CORD-19)</a>.


El tipo de afinamiento que se realiza es para una tarea de <a href="https://en.wikipedia.org/wiki/Multi-label_classification">Clasificación de Etiquetas Múltiples</a> (Multi-label classification). 

Para esto, se toma como datos de entrenamiento aquellos obtenidos en la *clasificación no-supervisada de tópicos* (sobre el conjunto de datos previamente mencionado), realizada en el notebook **BSAC_Covid_19-Parte_1-Interactive_Abstract_and_Expert_Finder.ipynb**, mismo que se basa en el trabajo <a href="https://www.kaggle.com/jdparsons/biobert-corex-topic-search">Interactive Search using BioBERT and CorEx Topic Modeling</a> del autor John David Parsons. 

El afinamiento realizado, se basa en el artículo <a href="https://towardsdatascience.com/building-a-multi-label-text-classifier-using-bert-and-tensorflow-f188e0ecdc5d">Building a Multi-label Text Classifier using BERT and TensorFlow</a> del autor Javaid Nabi, cuyo código está disponible <a href="https://github.com/javaidnabi31/Multi-Label-Text-classification-Using-BERT/blob/master/multi-label-classification-bert.ipynb">aquí</a>.

Adicionamente, se consultaron otras fuentes como:

|Título|Autor|Recurso|
|-|-|-|
|Multi-label Text Classification using BERT – The Mighty Transformer|Kaushal Trivedi|<a href="https://medium.com/huggingface/multi-label-text-classification-using-bert-the-mighty-transformer-69714fa3fb3d">blog</a> <a href="https://nbviewer.jupyter.org/github/kaushaltrivedi/bert-toxic-comments-multilabel/blob/master/toxic-bert-multilabel-classification.ipynb">código</a>|
|run_classifier.py|Google AI Team|<a href="https://github.com/google-research/bert/blob/master/run_classifier.py">código</a>|
|run_multilabels_classifier.py|Google AI Team, huangyajian|<a href="https://github.com/yajian/bert/blob/master/run_multilabels_classifier.py">código</a>|

# 1. Configuración y Carga de Datos

## 1.1 Parámetros

In [None]:
# Nombres de Archivos a importar
file_corex_topics = 'corex_topic_model.pkl' # Producto de ntbk 'BioBERT + Corex Topic Search'
file_df_topics = 'df_final_covid_clean_topics.pkl' # Producto de ntbk 'BioBERT + Corex Topic Search'

# Path de Directorios
path_folder_root = '/content/drive/My Drive/'
# NOTA: Modificar los siguientes nombres a directorios existentes en su Google Drive
path_folder_project_clustering = path_folder_root + 'NLP/projects/interactive-abstract-and-expert-finder/'
path_folder_project_bert = path_folder_root + 'NLP/projects/bert/'
path_folder_input = 'input/'
path_folder_output = 'output/'
path_models = path_folder_root + 'BERT/models/'
path_folder_fine_tuning = path_models + 'fine-tuned/'
path_folder_model_tuned = path_folder_fine_tuning + 'model/'
path_folder_train_output = path_folder_fine_tuning + 'train/'
path_folder_eval_output = path_folder_fine_tuning + 'eval/'

# Path de Archivos
path_file_corex_topics = path_folder_project_clustering + path_folder_output + file_corex_topics
path_file_df_topics =  path_folder_project_clustering + path_folder_output + file_df_topics


# Parámetros BERT
# Descargar Modelo desde: https://github.com/dmis-lab/biobert
# Este notebook trabaja con la siguiente versión del modelo BioBert:
# BioBERT-Base v1.1 (+ PubMed 1M) - based on BERT-base-Cased (same vocabulary)
bert_model = 'biobert_v1.1_pubmed'

dict_bert_params = {}
dict_bert_params['model_dir'] = bert_model
# Contains model vocabulary [ words to indexes mapping]
dict_bert_params['vocab'] = 'vocab.txt'
# Contains BERT model architecture.
dict_bert_params['config'] = 'bert_config.json'
# Contains weights of the pre-trained model
dict_bert_params['init_chkpnt'] = 'model.ckpt-1000000'

for k, v in dict_bert_params.items():
  dict_bert_params[k] = path_models + bert_model + '/' + v


# Flags
flag_use_google_drive = True

# Paráámetros Generales de Entrenamiento
ID = 'cord_uid'
DATA_COLUMN = 'document'
LABEL_COLUMNS = [] # será asignará la lista 'list_topic_words' en sección 2.4 (Matriz DTP)
NUM_OF_FEATURES = 0 # Número de Características o Equitetas (LABEL_COLUMNS), actualizado en sección 2.4

In [None]:
# Nombres de Archivos a importar
file_corex_topics = 'corex_topic_model.pkl' # Producto de ntbk 'BioBERT + Corex Topic Search'
file_df_topics = 'df_final_covid_clean_topics.pkl' # Producto de ntbk 'BioBERT + Corex Topic Search'

# Path de Directorios
path_folder_root = '/content/drive/My Drive/'
# NOTA: Modificar los siguientes nombres a directorios existentes en su Google Drive
path_folder_project_clustering = path_folder_root + 'NLP/projects/interactive-abstract-and-expert-finder/'
path_folder_project_bert = path_folder_root + 'NLP/projects/bert/'
path_folder_input = 'input/'
path_folder_output = 'output/'
path_models = path_folder_root + 'BERT/models/'
path_folder_model_output = path_models + 'fine-tuned/' # train output
path_folder_eval_output = path_folder_model_output + 'eval/' # model evaluation output

# Path de Archivos
path_file_corex_topics = path_folder_project_clustering + path_folder_output + file_corex_topics
path_file_df_topics =  path_folder_project_clustering + path_folder_output + file_df_topics


# Parámetros BERT
# Descargar Modelo desde: https://github.com/dmis-lab/biobert
# Este notebook trabaja con la siguiente versión del modelo BioBert:
# BioBERT-Base v1.1 (+ PubMed 1M) - based on BERT-base-Cased (same vocabulary)
bert_model = 'biobert_v1.1_pubmed'

dict_bert_params = {}
dict_bert_params['model_dir'] = bert_model
# Contains model vocabulary [ words to indexes mapping]
dict_bert_params['vocab'] = 'vocab.txt'
# Contains BERT model architecture.
dict_bert_params['config'] = 'bert_config.json'
# Contains weights of the pre-trained model
dict_bert_params['init_chkpnt'] = 'model.ckpt-1000000'

for k, v in dict_bert_params.items():
  dict_bert_params[k] = path_models + bert_model + '/' + v


# Flags
flag_use_google_drive = True

# Paráámetros Generales de Entrenamiento
ID = 'cord_uid'
DATA_COLUMN = 'document'
LABEL_COLUMNS = [] # será asignará la lista 'list_topic_words' en sección 2.4 (Matriz DTP)
NUM_OF_FEATURES = 0 # Número de Características o Equitetas (LABEL_COLUMNS), actualizado en sección 2.4

In [None]:
# Construir Diccionario de Parámetros y mostrarlos
# https://thispointer.com/python-filter-a-dictionary-by-conditions-on-keys-or-values/
#dict_params = dict(filter(lambda item: item[0].find('file') != -1, globals().items()))

# Construir Diccionario de Parámetros y mostrarlos
# https://stackoverflow.com/questions/6531482/how-to-check-if-a-string-contains-an-element-from-a-list-in-python
paramToCheck = ['file', 'folder', 'bert', 'flag']

dict_params = dict(
    filter(lambda item:  any(item[0].find(paramType) != -1 for paramType in paramToCheck)
                  and isinstance(item[1], (str, bool)),
           globals().items())
    )

print(len(dict_params))

for param in sorted(dict_params):
  print( str(param + ':\t').rjust(30) + str(dict_params[param]))


16
                  bert_model:	biobert_v1.1_pubmed
           file_corex_topics:	corex_topic_model.pkl
              file_df_topics:	df_final_covid_clean_topics.pkl
       flag_use_google_drive:	True
      path_file_corex_topics:	/content/drive/My Drive/NLP/projects/interactive-abstract-and-expert-finder/output/corex_topic_model.pkl
         path_file_df_topics:	/content/drive/My Drive/NLP/projects/interactive-abstract-and-expert-finder/output/df_final_covid_clean_topics.pkl
     path_folder_eval_output:	/content/drive/My Drive/BERT/models/fine-tuned/eval/
     path_folder_fine_tuning:	/content/drive/My Drive/BERT/models/fine-tuned/
           path_folder_input:	input/
    path_folder_model_output:	/content/drive/My Drive/BERT/models/fine-tuned/
     path_folder_model_tuned:	/content/drive/My Drive/BERT/models/fine-tuned/model/
          path_folder_output:	output/
    path_folder_project_bert:	/content/drive/My Drive/NLP/projects/bert/
path_folder_project_clustering:	/content/drive/

## 1.2 Instalar Dependencias

#### 1.2.1 CorEx

In [None]:
# CorEx topic modeling dependencies
# https://github.com/gregversteeg/corex_topic
!pip install 'corextopic'

Collecting corextopic
  Downloading https://files.pythonhosted.org/packages/64/1d/b2a320090c91e67dd93a7e6715794a126b1a56c643824444c952b303dc0a/corextopic-1.0.6-py3-none-any.whl
Installing collected packages: corextopic
Successfully installed corextopic-1.0.6


#### 1.2.2 BERT

In [None]:
##install bert if not already done
!pip install bert-tensorflow

Collecting bert-tensorflow
[?25l  Downloading https://files.pythonhosted.org/packages/a6/66/7eb4e8b6ea35b7cc54c322c816f976167a43019750279a8473d355800a93/bert_tensorflow-1.0.1-py2.py3-none-any.whl (67kB)
[K     |████▉                           | 10kB 24.3MB/s eta 0:00:01[K     |█████████▊                      | 20kB 1.7MB/s eta 0:00:01[K     |██████████████▋                 | 30kB 2.2MB/s eta 0:00:01[K     |███████████████████▍            | 40kB 2.5MB/s eta 0:00:01[K     |████████████████████████▎       | 51kB 2.0MB/s eta 0:00:01[K     |█████████████████████████████▏  | 61kB 2.2MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 2.1MB/s 
Installing collected packages: bert-tensorflow
Successfully installed bert-tensorflow-1.0.1


## 1.3 Importar Librerías

In [None]:
# 1) Bajar (Downgrade) la versión de Tensorflow en Google Colab de 2.x a 1.x
# (por compatibilidad con algunas librerías de BERT y Bert as a Service)

# NOTA: Al ejecutar `import tensorflow` se importará (en Google Colab) por defecto la versión 2.x
# (actualmente).
# Se puede utilizar la versión 1.x al ejecutar una celda con el comando 'mágico' `tensorflow_version`
# antes de ejecutar `import tensorflow`.

# TensorFlow versions in Colab
# https://colab.research.google.com/notebooks/tensorflow_version.ipynb#scrollTo=8UvRkm1JGUrk

# *How to downgrade tensorflow version in colab?*  
# https://stackoverflow.com/questions/51888118/how-to-downgrade-tensorflow-version-in-colab/54445624
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [None]:
%%time

# 2) Importar Librerías
import bert
import collections
import numpy as np
import os
import pandas as pd
import pickle
import tensorflow as tf
import tensorflow_hub as hub

# Corex Imports
from corextopic import corextopic as ct

# BERT Imports
from bert import run_classifier
from bert import optimization
from bert import tokenization
from bert import modeling

# Other general Imports
from datetime import datetime


CPU times: user 1.14 s, sys: 218 ms, total: 1.36 s
Wall time: 4.97 s


## 1.4 Montar Google Drive

In [None]:
# Mount Google Drive to this Notebook instance.
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## 1.5 Importar la data resultante del anterior Procesamiento

### 1.5.1 Leer DataFrame de Abstact de Artículos y Tópicos

In [None]:
# 1) Leer DataFrame con el Abstract de los Artíículos y Tópicos Clasificados
# Archivo: df_final_covid_clean_topics.pkl
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_pickle.html
df_covid = pd.read_pickle(path_file_df_topics)

# 2) Establecer como índice la columna 'index'
#df_covid.set_index('cord_uid', inplace=True)

# 3) Elminar columnas que no se necesitan para el procesamiento requerido
df_covid.drop(columns=
['index',
 'sha',
 'source_x',
 'title',
 'doi',
 'pmcid',
 'pubmed_id',
 'license',
 'abstract',
 'publish_time',
 'authors',
 'journal',
 'Microsoft Academic Paper ID',
 'WHO #Covidence',
 'arxiv_id',
 'has_pdf_parse',
 'has_pmc_xml_parse',
 'full_text_file',
 'url'], inplace=True
)

In [None]:
print("df_covid datatype:", type(df_covid))
print("df_covid shape:", df_covid.shape)
df_covid.info()

df_covid datatype: <class 'pandas.core.frame.DataFrame'>
df_covid shape: (48510, 24)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48510 entries, 0 to 48509
Data columns (total 24 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   cord_uid     48510 non-null  object 
 1   document     48510 non-null  object 
 2   clean        48510 non-null  object 
 3   clean_tfidf  48510 non-null  object 
 4   topic_0      48510 non-null  float64
 5   topic_1      48510 non-null  float64
 6   topic_2      48510 non-null  float64
 7   topic_3      48510 non-null  float64
 8   topic_4      48510 non-null  float64
 9   topic_5      48510 non-null  float64
 10  topic_6      48510 non-null  float64
 11  topic_7      48510 non-null  float64
 12  topic_8      48510 non-null  float64
 13  topic_9      48510 non-null  float64
 14  topic_10     48510 non-null  float64
 15  topic_11     48510 non-null  float64
 16  topic_12     48510 non-null  float64
 17  top

In [None]:
df_covid.head()

Unnamed: 0,cord_uid,document,clean,clean_tfidf,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,zjufx4fo,Sequence requirements for RNA strand transfer ...,"[[sequence, requirement, rna, strand, transfer...","[sequence, requirement, rna, strand, transfer,...",-11.967215,-14.36535,-4.232822e-05,-6.03878,0.0,-9.256285,-8.990463,-10.88836,-8.524362,-4.91486,-17.746458,-10.139457,-8.976088,-8.913227,-8.710551,-5.55459,-5.697667,-8.9e-05,-5.325501,-1.48462
1,ymceytj3,"Crystal structure of murine sCEACAM1a[1,4]: a ...","[[crystal_structure, murine, sceacama, coronav...","[crystal_structure, murine, coronavirus, recep...",-13.910713,-14.805321,-3.573142e-12,-7.631686,-8.760851,-9.256256,-10.317028,-10.889253,-14.412434,-1e-05,-25.626166,-5.738624,-8.000439,-8.900729,-8.678459,-2.377468,-6.530519,-1.2e-05,-9.027035,-5.345502
2,wzj2glte,Synthesis of a novel hepatitis C virus protein...,"[[synthesis, novel, hepatitis, virus, protein,...","[synthesis, novel, hepatitis, ribosomal_frames...",-13.910713,-12.204447,-1.0595e-07,-10.775814,0.0,-9.256337,-8.045243,-10.887869,-6.730414,-4.399364,-25.626187,-4.930299,-8.975826,-5.811124,-8.505479,-0.304766,-5.277167,-0.229326,-5.505994,-0.820358
3,2sfqsfm1,Structure of coronavirus main proteinase revea...,"[[structure, coronavirus, main, proteinase, re...","[structure, coronavirus, main, proteinase, rev...",-13.911672,-13.899673,0.0,-6.329999,-0.10705,-5.852967,-4.255949,-10.853199,-4.850234,-8.392439,-25.625044,-10.143355,-5.797064,-2.43338,-7.356429,-5.216935,-6.513124,-8.4e-05,-5.068582,-4.267154
4,i0zym7iq,Discontinuous and non-discontinuous subgenomic...,"[[discontinuous, nondiscontinuous, subgenomic_...","[discontinuous, subgenomic_rna, transcription,...",-10.473175,-13.169311,-0.07774288,-11.078761,0.0,-9.256224,-4.648812,-10.888558,-6.800841,-5.802712,-25.626739,-10.143009,-8.356926,-8.898013,-8.326373,-3.828599,-6.539708,-0.46064,-1.997524,-0.005377


### 1.5.2 Leer Tópicos CorEx

In [None]:
# Leer Numpy Array de: corex_topic_model.pkl
# Conjunto de tópicos clasificados con CorEx
corex_topics = pd.read_pickle(path_file_corex_topics)

print("corex_topics datatype:", type(corex_topics))
print("len(corex_topics.words):", len(corex_topics.words))
print("len(corex_topics.get_topics()):", len(corex_topics.get_topics()))

corex_topics datatype: <class 'corextopic.corextopic.Corex'>
len(corex_topics.words): 9089
len(corex_topics.get_topics()): 20


# 2. Procesamiento de Datos

## 2.1 Extraer Información de los DataFrames

### 2.1.1 Extraer los Términos Frecuentes de Cada Documento

In [None]:
# 1) Cargar los Términos utilizados en los Documentos, ordenados por frecuencia
doc_terms = df_covid['clean_tfidf']
doc_terms.head()

0    [sequence, requirement, rna, strand, transfer,...
1    [crystal_structure, murine, coronavirus, recep...
2    [synthesis, novel, hepatitis, ribosomal_frames...
3    [structure, coronavirus, main, proteinase, rev...
4    [discontinuous, subgenomic_rna, transcription,...
Name: clean_tfidf, dtype: object

In [None]:
# 2) Convertir Columna de Frecuencia de Términos a Lista
doc_terms = doc_terms.tolist()
print("doc_terms type:", type(doc_terms))
print("len(doc_terms):", len(doc_terms))

doc_terms type: <class 'list'>
len(doc_terms): 48510


### 2.1.2 Extraer los Tópicos CorEx

In [None]:
# 1) Obtener Tópicos
#help(corex_topics.get_topics)
topics = corex_topics.get_topics()  # Default: n_words=10
print("type(topics):", type(topics))
print("len(topics):", len(topics))

type(topics): <class 'list'>
len(topics): 20


In [None]:
# 2) Generar Lista y Diccionario de Tópicos
# 2.1.a) Generar Lista de Tópicos
topic_list = []
# 2.1.b) Generar Diccionario Tópicos
topic_dict = {}

for n, topic in enumerate(topics):
    topic_words, scores = zip(*topic)
    #print("type(topic_words):", type(topic_words))
    #print("type(scores):", type(scores))
    #print(f"({topic_words} | {scores})")
    #print("type(topic):", type(topic))
    #print(topic)
    
    #print('{}: '.format(n) + ','.join(topic_words))
    #topic_list.append('topic_' + str(n) + ': ' + ', '.join(topic_words))
    topic_id = 'topic_' + str(n)
    topic_list.append(topic_id + ': ' + str(topic))
    topic_dict[topic_id] = topic


In [None]:
# 2.2.a) Verificar Lista de Tópicos
print(len(topic_list))
topic_list

20


["topic_0: [('health', 0.07819868863594989), ('public_health', 0.047379193897883566), ('national', 0.04701427268345163), ('risk', 0.046775351041916026), ('policy', 0.0439273024375846), ('international', 0.04352759900681843), ('care', 0.04023434762464224), ('practice', 0.036697040974408186), ('medical', 0.036401359257311756), ('measure', 0.035782878395272784)]",
 "topic_1: [('patient', 0.15485459288984565), ('respiratory', 0.12153850117832599), ('child', 0.07954655572121214), ('clinical', 0.07797813266884773), ('acute', 0.07133336096940021), ('pneumonia', 0.061411567951009195), ('symptom', 0.06026015650904005), ('respiratory_tract', 0.05267492814517904), ('respiratory_syncytial', 0.04918432880240683), ('rhinovirus', 0.04347807305923954)]",
 "topic_2: [('binding', 0.08487981563257542), ('activity', 0.07399665259932148), ('inhibit', 0.06402164406873423), ('inhibitor', 0.060225820339119746), ('membrane', 0.05039574144437599), ('domain', 0.050373571524239394), ('inhibition', 0.0466543092971

In [None]:
# 2.2.b) Verificar Diccionario de Tópicos
print(len(topic_dict))
#topic_dict

20


## 2.2 Determinar el tópico más representativo de cada documento

In [None]:
# 1) De entre las columnas de los tópicos (topic_), obtener el
# 'nombre de la columna' con el valor más representativo
#https://thispointer.com/pandas-find-maximum-values-position-in-columns-or-rows-of-a-dataframe/
corex_cols = [col for col in df_covid if col.startswith('topic_')]
print(corex_cols)

# Determinar el tópico más representativo para cada documento
df_covid['best_topic'] = df_covid[corex_cols].idxmax(axis=1)
df_covid['best_topic'].describe()

['topic_0', 'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6', 'topic_7', 'topic_8', 'topic_9', 'topic_10', 'topic_11', 'topic_12', 'topic_13', 'topic_14', 'topic_15', 'topic_16', 'topic_17', 'topic_18', 'topic_19']


count       48510
unique         20
top       topic_2
freq         5894
Name: best_topic, dtype: object

In [None]:
# 2) De entre las columnas de los tópicos (topic_), obtener el
# valor (puntaje) más repesentativo.
# NOTA: Este valor corresponde al de la columna cuyo nombre consta
# bajo 'best_topic'.
df_covid['best_topic_score'] = df_covid[corex_cols].max(axis=1)
df_covid['best_topic_score'].describe()

count    4.851000e+04
mean    -5.363184e-02
std      4.337930e-01
min     -5.411805e+00
25%     -2.085213e-08
50%     -6.217249e-14
75%      0.000000e+00
max      0.000000e+00
Name: best_topic_score, dtype: float64

In [None]:
# 3) Eliminar filas con 'best_topic_score' <= 0
#df.drop(df[condition].index, axis=0, inplace=True)
#df_covid.drop(df_covid[df_covid.best_topic_score <= 0].index, axis=0, inplace=True)
#df_covid.shape

## 2.3 Determinar las N palabras más representativas de cada Tópico

In [None]:
# 1) Determinar el Número N de Palabras a utilizar
# NOTA: 
# De acuerdo a la descripción de la función get_topics(), la misma retorna
# por defecto 10 palabras por cada tópico, ordenadas por su relevancia dentro
# del tópico.
#   help(corex_topics.get_topics) # Default: n_words=10
# De estas 10 palabras, nosotros arbitrariamente seleccionaremos las N primeras
# procurando que no sean muchas, lo cual demoraría el entrenamiento (afinamiento)
# posterior del modelo BERT. 
top_n_words = 3

# Queda pendiente determinar una metodología apropiada para no seleccionar este
# número de manera arbitraria

In [None]:
# 2) Crear columnas para las N palabras Top

# 2.a) Determinar nombres de las columnas
word_cols = []
for i in range(0, top_n_words):
  columna = "word_" + str(i)
  #print(f"Iteracion {i}, columna '{columna}")
  word_cols.append(columna)

print(word_cols)

# 2.b) Crear las columnas en el DataFrame
for i, col in enumerate(word_cols):
  df_covid[col] = ""

df_covid[word_cols].describe()  

['word_0', 'word_1', 'word_2']


Unnamed: 0,word_0,word_1,word_2
count,48510.0,48510.0,48510.0
unique,1.0,1.0,1.0
top,,,
freq,48510.0,48510.0,48510.0


In [None]:
# 3) Actualizar columnas Nuevas (N  Palabras Top de Tópicos)
#https://thispointer.com/pandas-apply-a-function-to-single-or-selected-columns-or-rows-in-dataframe/ 
#https://datatofish.com/if-condition-in-pandas-dataframe/
def get_topic_word(topic_dictionary, topic_id, word_index):
  # Dado un diccionario de tópicos 'topic_dictionary' obtener del
  # tópico 'topic_id' la palabra ubicada enl índice 'word_index'
  word, weight = topic_dictionary[topic_id][word_index]
  return word

for i in range(0, top_n_words):
  col = word_cols[i] # Nombre de Columna actual
  topic_id = 'topic_' + str(i)
  df_covid[ col ] =  df_covid['best_topic'].apply(lambda x: get_topic_word(topic_dict, x, i) )

df_covid[word_cols].describe() 

Unnamed: 0,word_0,word_1,word_2
count,48510,48510,48510
unique,20,20,20
top,binding,activity,inhibit
freq,5894,5894,5894


## 2.4 Construir Matriz 'Documento-Tópico-Palabras'

### 2.4.1 Construir Lista de palabras relevantes

In [None]:
# 1) Construir Lista de todas las palabras relevantes de todos los documentos
set_topic_words = set()

for i in range(0, top_n_words):
  col = word_cols[i] # Nombre de Columna actual
  
  # Obtener datos de columna (word_{i}) como Lista  
  list_words = df_covid[col].to_list()
  #print(f"list_words {i}: ", len(list_words))
  
  set_words = set(list_words)
  #print(f"set_words: ", len(set_words))

  # Agregar Lista a Set de Palabras
  set_topic_words = set_topic_words.union(set_words)

list_topic_words = sorted(set_topic_words)
print( len(list_topic_words) )

# Asignar como lista de Etiquetas (Labels) la lista de palabras calculada
LABEL_COLUMNS = list_topic_words 
# Asignar como número de Características, el número de palabras encontradas
NUM_OF_FEATURES = len(list_topic_words)

print("NUM_OF_FEATURES:", NUM_OF_FEATURES)

60
NUM_OF_FEATURES: 60


### 2.4.2 Construir Matriz

In [None]:
# 1) Agregar a la matriz las palabras y asignar como valor de la celda:
# 1 cuando haya ocurrencia de la palabra en el documento
# 0 cuando no haya ocurrencia de la palabra en el documento
# NOTA: Esta rutina demora alrededor de 3 minutos!
%%time
for i, word in enumerate( list_topic_words ):
  # https://thispointer.com/pandas-apply-a-function-to-single-or-selected-columns-or-rows-in-dataframe/
  df_covid[word] = \
    df_covid[word_cols].apply(lambda row: 1 if any([word == row[ word_cols[i] ] for i in range(len(word_cols))]) else 0, axis='columns')

#df_doc_topic_words.info() 

CPU times: user 1min 25s, sys: 35 ms, total: 1min 25s
Wall time: 1min 26s


In [None]:
# 2) Comprobar presencia de registros con score positivo
len(df_covid[df_covid.best_topic_score != 0])

27874

In [None]:
# 3) Comprobar el número de ocurrencias en el documento (palabras con valor 1)
# NOTA: debe ser igual al valor del parámetro 'top_n_words'
#https://stackoverflow.com/questions/25748683/pandas-sum-dataframe-rows-for-given-columns
df_covid['count'] = df_covid.loc[:,list_topic_words].sum(axis=1)

In [None]:
# 4) Mostrar información del nuevo dataframe
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48510 entries, 0 to 48509
Data columns (total 90 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   cord_uid                  48510 non-null  object 
 1   document                  48510 non-null  object 
 2   clean                     48510 non-null  object 
 3   clean_tfidf               48510 non-null  object 
 4   topic_0                   48510 non-null  float64
 5   topic_1                   48510 non-null  float64
 6   topic_2                   48510 non-null  float64
 7   topic_3                   48510 non-null  float64
 8   topic_4                   48510 non-null  float64
 9   topic_5                   48510 non-null  float64
 10  topic_6                   48510 non-null  float64
 11  topic_7                   48510 non-null  float64
 12  topic_8                   48510 non-null  float64
 13  topic_9                   48510 non-null  float64
 14  topic_

In [None]:
# 5) Desplegar muestra de N registros de 'df_doc_topic_words'
df_covid.sample(n=5)

Unnamed: 0,cord_uid,document,clean,clean_tfidf,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19,best_topic,best_topic_score,word_0,word_1,word_2,activation,activity,age,antibody,antigen,approach,background,base,binding,calf,cell_line,...,diarrhea,effect,express,expression,gene,genome,global,health,increase,inhibit,interaction,le,mechanism,model,mouse,national,need,objective,patient,pig,population,public_health,que,replication,research,respiratory,review,role,sample,sarscov,sequence,severe_acute_respiratory,significantly,structure,surveillance,total,understand,vaccine,vitro,count
19425,y0s0yydl,Mutation of Host Δ9 Fatty Acid Desaturase Inhi...,"[[mutation, host, fatty_acid, desaturase, inhi...","[mutation, host, fatty_acid, inhibits, brome_m...",-12.795788,-12.03499,0.0,-2.512689,0.0,-9.256283,-10.317463,-5.819684,-8.509904,-5.63586,-25.626276,-10.14165,-5.752181,-4.667865,-7.916734,-6.725431,-5.676054,-2e-06,-0.078273,-0.034047,topic_2,0.0,binding,activity,inhibit,0,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3
10334,mkegxh7v,Adaptive memory and evolution of the human nat...,"[[adaptive, memory, evolution, human, naturali...","[adaptive, memory, evolution, mind, insight, m...",-5.937405,-6.827415,-8.997992,-13.069386,-5.844794e-07,-8.283356,-4.392781,-10.888228,-8.524486,-2.138001,-25.626353,-10.016427,-2.975826,-1e-06,-2.559081,-0.000295,-4.545586,-3.734556,-0.146203,-5.366926,topic_4,-5.844794e-07,sequence,genome,gene,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,3
4337,b0tlco4t,Etiology of Severe Childhood Pneumonia in The ...,"[[etiology, severe, childhood, pneumonia, gamb...","[etiology, severe, childhood, pneumonia, west_...",-13.911759,-8.401102e-11,-10.12527,-12.22849,-11.16279,-8.057314,-10.320849,-0.903339,-10.353499,-8.389869,-25.625982,-5.839084,-8.9692,-6.121042,-6.111404,-6.848467,-4.358565,-4.58081,-6.391552,-3.628143,topic_1,-8.401102e-11,patient,respiratory,child,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,3
14099,fduv4yk2,Ibis T5000: a universal biosensor approach for...,"[[ibis, universal, biosensor, approach, microb...","[universal, biosensor, approach, microbiology,...",-13.910713,-12.86583,-7.771083,-11.81698,-7.409468,-7.670176,-0.005939,-0.001621,-10.352767,-8.392398,-25.626002,-10.143011,-8.975376,-5.200953,-7.938087,-0.442373,-6.543736,-3.936592,-5.89325,-5.379325,topic_7,-0.001621377,detection,sample,detect,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,3
10930,ebx603bn,Peptide-conjugate antisense based splice-corre...,"[[peptideconjugate, antisense, base, splicecor...","[antisense, base, muscular, dystrophy, muscula...",-13.910659,-7.406791,-2.583267e-12,-8.20238,-5.233654,-3.751229,-4.724675,-7.177901,-7.956844,-1.543536,-25.625987,-10.142098,-6.296149,-0.006447,-6.713429,-0.001542,-4.960554,-1e-06,-1.993565,-2.957728,topic_2,-2.583267e-12,binding,activity,inhibit,0,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3


# 3. Fine-Tuning (Afinamiento) de modelo BERT

## 3.1 Preparación de Programas y Configuraciones

### 3.1.1 Definir Clases y Funciones utilitarias para entrenamiento

A continuacióón se necesita convertir la data en el formato que BERT requiere para interpretarla. Algunas clases y funciones utilitarias se provee para ello.

#### 3.1.1.1 InputExample

Clase que representa una muestra de entrenamiento / prueba para un modelo de clasificación de secuencia simple.

In [None]:
class InputExample(object):
    """A single training/test example for simple sequence classification."""

    def __init__(self, guid, text_a, text_b=None, labels=None):
        """Constructs a InputExample.

        Args:
            guid: Unique id for the example.
            text_a: string. The untokenized text of the first sequence. For single
            sequence tasks, only this sequence must be specified.
            text_b: (Optional) string. The untokenized text of the second sequence.
            Only must be specified for sequence pair tasks.
            labels: (Optional) [string]. The label of the example. This should be
            specified for train and dev examples, but not for test examples.
        """
        self.guid = guid
        self.text_a = text_a
        self.text_b = text_b
        self.labels = labels

#### 3.1.1.2 InputFeatures

Clase que representa un set de características (variables) de datos

In [None]:
class InputFeatures(object):
    """A single set of features of data."""

    def __init__(self, input_ids, input_mask, segment_ids, label_ids, is_real_example=True):
        self.input_ids = input_ids
        self.input_mask = input_mask
        self.segment_ids = segment_ids
        self.label_ids = label_ids,
        self.is_real_example=is_real_example

#### 3.1.1.3 create_examples()

`create_examples()`, lee el dataframe y carga el texto de entrada y sus correspondientes etiquetas objetivo (clases) en el objeto `InputExample`.

In [None]:
def create_examples(df, idx_column_index, idx_column_text, idx_columns_labels):
    """Creates examples for the training and dev sets."""
    examples = []
    for (i, row) in enumerate(df.values):
        #print("row:\n",row)
        #print("index_column:", idx_column_index)
        #print("row[idx_column_index]:", row[idx_column_index])
        #print("row[0]:", row[0])
        guid = row[idx_column_index]
        text_a = row[idx_column_text]
        labels = row[idx_columns_labels]
        # Agregar una instancia de 'InputExample' a la lista de muestras
        examples.append(
            InputExample(guid=guid, text_a=text_a, labels=labels))
    return examples

#### 3.1.1.4 convert_examples_to_features()

Usando `tokenizer`, llamaremos al método **`convert_examples_to_features`** en nuestras muestras para convertirlas en características que BERT entienda.

Este método agrega los tokens especiales **"CLS"** y **"SEP"** utilizados por BERT para identificar el *inicio* y el *final* de la oración. También agrega tokens de **"index"** (índice) y **"segment"** (segmento) a cada entrada. Por lo tanto, esta función realiza todo el trabajo de formateo de entrada según el BERT.

<img src="https://miro.medium.com/max/1400/1*IA45-w25Ach4LcgsBABYKA.png" alt="Representación de las entradas de BERT" width="60%" height="60%" />  
*Representación de entradas de BERT. Las incrustaciones de entrada (input embeddings) son la suma de las incrustaciones de: token, segmentación y posición.*

In [None]:
def convert_examples_to_features(examples,  max_seq_length, tokenizer):
    """Loads a data file into a list of `InputBatch`s."""

    features = []
    for (ex_index, example) in enumerate(examples):
        print(example.text_a)
        tokens_a = tokenizer.tokenize(example.text_a)

        tokens_b = None
        if example.text_b:
            tokens_b = tokenizer.tokenize(example.text_b)
            # Modifies `tokens_a` and `tokens_b` in place so that the total
            # length is less than the specified length.
            # Account for [CLS], [SEP], [SEP] with "- 3"
            _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
        else:
            # Account for [CLS] and [SEP] with "- 2"
            if len(tokens_a) > max_seq_length - 2:
                tokens_a = tokens_a[:(max_seq_length - 2)]

        # The convention in BERT is:
        # (a) For sequence pairs:
        #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
        #  type_ids: 0   0  0    0    0     0       0 0    1  1  1  1   1 1
        # (b) For single sequences:
        #  tokens:   [CLS] the dog is hairy . [SEP]
        #  type_ids: 0   0   0   0  0     0 0
        #
        # Where "type_ids" are used to indicate whether this is the first
        # sequence or the second sequence. The embedding vectors for `type=0` and
        # `type=1` were learned during pre-training and are added to the wordpiece
        # embedding vector (and position vector). This is not *strictly* necessary
        # since the [SEP] token unambigiously separates the sequences, but it makes
        # it easier for the model to learn the concept of sequences.
        #
        # For classification tasks, the first vector (corresponding to [CLS]) is
        # used as as the "sentence vector". Note that this only makes sense because
        # the entire model is fine-tuned.
        tokens = ["[CLS]"] + tokens_a + ["[SEP]"]
        segment_ids = [0] * len(tokens)

        if tokens_b:
            tokens += tokens_b + ["[SEP]"]
            segment_ids += [1] * (len(tokens_b) + 1)

        input_ids = tokenizer.convert_tokens_to_ids(tokens)

        # The mask has 1 for real tokens and 0 for padding tokens. Only real
        # tokens are attended to.
        input_mask = [1] * len(input_ids)

        # Zero-pad up to the sequence length.
        padding = [0] * (max_seq_length - len(input_ids))
        input_ids += padding
        input_mask += padding
        segment_ids += padding

        assert len(input_ids) == max_seq_length
        assert len(input_mask) == max_seq_length
        assert len(segment_ids) == max_seq_length
        
        labels_ids = []
        for label in example.labels:
            labels_ids.append(int(label))

        if ex_index < 0:
            logger.info("*** Example ***")
            logger.info("guid: %s" % (example.guid))
            logger.info("tokens: %s" % " ".join(
                    [str(x) for x in tokens]))
            logger.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
            logger.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
            logger.info(
                    "segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
            logger.info("label: %s (id = %s)" % (example.labels, labels_ids))

        features.append(
                InputFeatures(input_ids=input_ids,
                              input_mask=input_mask,
                              segment_ids=segment_ids,
                              label_ids=labels_ids))
    return features

#### 3.1.1.5 Crear el Modelo: create_model()

Se utiliza el modelo BERT (Bio BERT) previamente entrenado y lo afinamos para nuestra tarea de clasificación.

Básicamente cargamos el modelo pre-entrenado y luego entrenamos la última capa para la tarea de clasificación.

In [None]:
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
                 labels, num_labels, use_one_hot_embeddings):
    """Creates a classification model."""
    model = modeling.BertModel(
        config=bert_config,
        is_training=is_training,
        input_ids=input_ids,
        input_mask=input_mask,
        token_type_ids=segment_ids,
        use_one_hot_embeddings=use_one_hot_embeddings)

    # In the demo, we are doing a simple classification task on the entire
    # segment.
    #
    # If you want to use the token-level output, use model.get_sequence_output()
    # instead.
    output_layer = model.get_pooled_output()
    print("\ntype(output_layer):", type(output_layer))
    print("output_layer:\n", output_layer)

    hidden_size = output_layer.shape[-1].value
    print("\ntype(hidden_size):", type(hidden_size))
    print("hidden_size:\n", hidden_size)

    output_weights = tf.get_variable(
        "output_weights", [num_labels, hidden_size],
        initializer=tf.truncated_normal_initializer(stddev=0.02))
    print("\ntype(output_weights):\n", type(output_weights))
    print("output_weights:\n", output_weights)

    output_bias = tf.get_variable(
        "output_bias", [num_labels], initializer=tf.zeros_initializer())

    with tf.variable_scope("loss"):
        if is_training:
            # I.e., 0.1 dropout
            output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
            print("\ntype(output_layer):", type(output_layer))
            print("output_layer:", output_layer)

        logits = tf.matmul(output_layer, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        print("\ntype(logits):", type(logits))
        print("logits:", logits)

        # probabilities = tf.nn.softmax(logits, axis=-1) ### multiclass case
        probabilities = tf.nn.sigmoid(logits)#### multi-label case
        
        labels = tf.cast(labels, tf.float32)
        tf.logging.info("\nnum_labels:{};\nlogits:{};\nlabels:{}".format(num_labels, logits, labels))
        per_example_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits)
        loss = tf.reduce_mean(per_example_loss)

        # probabilities = tf.nn.softmax(logits, axis=-1)
        # log_probs = tf.nn.log_softmax(logits, axis=-1)
        #
        # one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
        #
        # per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
        # loss = tf.reduce_mean(per_example_loss)

        return (loss, per_example_loss, logits, probabilities)

In [None]:
def model_fn_builder(bert_config, num_labels, init_checkpoint, learning_rate,
                     num_train_steps, num_warmup_steps, use_tpu,
                     use_one_hot_embeddings):
    """Returns `model_fn` closure for TPUEstimator."""

    def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
        """The `model_fn` for TPUEstimator."""

        #tf.logging.info("*** Features ***")
        #for name in sorted(features.keys()):
        #    tf.logging.info("  name = %s, shape = %s" % (name, features[name].shape))

        input_ids = features["input_ids"]
        input_mask = features["input_mask"]
        segment_ids = features["segment_ids"]
        label_ids = features["label_ids"]
        is_real_example = None
        if "is_real_example" in features:
             is_real_example = tf.cast(features["is_real_example"], dtype=tf.float32)
        else:
             is_real_example = tf.ones(tf.shape(label_ids), dtype=tf.float32)

        is_training = (mode == tf.estimator.ModeKeys.TRAIN)

        (total_loss, per_example_loss, logits, probabilities) = create_model(
            bert_config, is_training, input_ids, input_mask, segment_ids, label_ids,
            num_labels, use_one_hot_embeddings)

        tvars = tf.trainable_variables()
        initialized_variable_names = {}
        scaffold_fn = None
        if init_checkpoint:
            (assignment_map, initialized_variable_names
             ) = modeling.get_assignment_map_from_checkpoint(tvars, init_checkpoint)
            if use_tpu:

                def tpu_scaffold():
                    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
                    return tf.train.Scaffold()

                scaffold_fn = tpu_scaffold
            else:
                tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

        tf.logging.info("**** Trainable Variables ****")
        for var in tvars:
            init_string = ""
            if var.name in initialized_variable_names:
                init_string = ", *INIT_FROM_CKPT*"
            #tf.logging.info("  name = %s, shape = %s%s", var.name, var.shape,init_string)

        output_spec = None
        if mode == tf.estimator.ModeKeys.TRAIN:

            train_op = optimization.create_optimizer(
                total_loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu)

            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=total_loss,
                train_op=train_op,
                scaffold=scaffold_fn)
        elif mode == tf.estimator.ModeKeys.EVAL:

            def metric_fn(per_example_loss, label_ids, probabilities, is_real_example):

                logits_split = tf.split(probabilities, num_labels, axis=-1)
                label_ids_split = tf.split(label_ids, num_labels, axis=-1)
                # metrics change to auc of every class
                eval_dict = {}
                for j, logits in enumerate(logits_split):
                    label_id_ = tf.cast(label_ids_split[j], dtype=tf.int32)
                    current_auc, update_op_auc = tf.metrics.auc(label_id_, logits)
                    eval_dict[str(j)] = (current_auc, update_op_auc)
                eval_dict['eval_loss'] = tf.metrics.mean(values=per_example_loss)
                return eval_dict

                ## original eval metrics
                # predictions = tf.argmax(logits, axis=-1, output_type=tf.int32)
                # accuracy = tf.metrics.accuracy(
                #     labels=label_ids, predictions=predictions, weights=is_real_example)
                # loss = tf.metrics.mean(values=per_example_loss, weights=is_real_example)
                # return {
                #     "eval_accuracy": accuracy,
                #     "eval_loss": loss,
                # }

            eval_metrics = metric_fn(per_example_loss, label_ids, probabilities, is_real_example)
            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                loss=total_loss,
                eval_metric_ops=eval_metrics,
                scaffold=scaffold_fn)
        else:
            print("mode:", mode,"probabilities:", probabilities)
            output_spec = tf.estimator.EstimatorSpec(
                mode=mode,
                predictions={"probabilities": probabilities},
                scaffold=scaffold_fn)
        return output_spec

    return model_fn

#### 3.1.1.6 Clases y Funciones utilitarias varias

##### class `PaddingInputExample`

In [None]:
class PaddingInputExample(object):
    """Fake example so the num input examples is a multiple of the batch size.
    When running eval/predict on the TPU, we need to pad the number of examples
    to be a multiple of the batch size, because the TPU requires a fixed batch
    size. The alternative is to drop the last batch, which is bad because it means
    the entire output data won't be generated.
    We use this class instead of `None` because treating `None` as padding
    battches could cause silent errors.
    """

##### `convert_single_example`

In [None]:
def convert_single_example(ex_index, example, max_seq_length,
                           tokenizer):
    """Converts a single `InputExample` into a single `InputFeatures`."""
  
    if isinstance(example, PaddingInputExample):
        return InputFeatures(
            input_ids=[0] * max_seq_length,
            input_mask=[0] * max_seq_length,
            segment_ids=[0] * max_seq_length,
            label_ids=0,
            is_real_example=False)

    tokens_a = tokenizer.tokenize(example.text_a)
    tokens_b = None
    if example.text_b:
        tokens_b = tokenizer.tokenize(example.text_b)

    if tokens_b:
        # Modifies `tokens_a` and `tokens_b` in place so that the total
        # length is less than the specified length.
        # Account for [CLS], [SEP], [SEP] with "- 3"
        _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
    else:
        # Account for [CLS] and [SEP] with "- 2"
        if len(tokens_a) > max_seq_length - 2:
            tokens_a = tokens_a[0:(max_seq_length - 2)]

    # The convention in BERT is:
    # (a) For sequence pairs:
    #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
    #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
    # (b) For single sequences:
    #  tokens:   [CLS] the dog is hairy . [SEP]
    #  type_ids: 0     0   0   0  0     0 0
    #
    # Where "type_ids" are used to indicate whether this is the first
    # sequence or the second sequence. The embedding vectors for `type=0` and
    # `type=1` were learned during pre-training and are added to the wordpiece
    # embedding vector (and position vector). This is not *strictly* necessary
    # since the [SEP] token unambiguously separates the sequences, but it makes
    # it easier for the model to learn the concept of sequences.
    #
    # For classification tasks, the first vector (corresponding to [CLS]) is
    # used as the "sentence vector". Note that this only makes sense because
    # the entire model is fine-tuned.
    tokens = []
    segment_ids = []
    tokens.append("[CLS]")
    segment_ids.append(0)
    for token in tokens_a:
        tokens.append(token)
        segment_ids.append(0)
    tokens.append("[SEP]")
    segment_ids.append(0)

    if tokens_b:
        for token in tokens_b:
            tokens.append(token)
            segment_ids.append(1)
        tokens.append("[SEP]")
        segment_ids.append(1)

    input_ids = tokenizer.convert_tokens_to_ids(tokens)

    # The mask has 1 for real tokens and 0 for padding tokens. Only real
    # tokens are attended to.
    input_mask = [1] * len(input_ids)

    # Zero-pad up to the sequence length.
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)

    assert len(input_ids) == max_seq_length
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length

    labels_ids = []
    for label in example.labels:
        labels_ids.append(int(label))


    feature = InputFeatures(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids,
        label_ids=labels_ids,
        is_real_example=True)
    return feature

##### `file_based_convert_examples_to_features`

In [None]:
def file_based_convert_examples_to_features(
        examples, max_seq_length, tokenizer, output_file):
    """Convert a set of `InputExample`s to a TFRecord file."""

    writer = tf.python_io.TFRecordWriter(output_file)

    for (ex_index, example) in enumerate(examples):
        #if ex_index % 10000 == 0:
            #tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))

        feature = convert_single_example(ex_index, example,
                                         max_seq_length, tokenizer)

        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f

        features = collections.OrderedDict()
        features["input_ids"] = create_int_feature(feature.input_ids)
        features["input_mask"] = create_int_feature(feature.input_mask)
        features["segment_ids"] = create_int_feature(feature.segment_ids)
        features["is_real_example"] = create_int_feature(
            [int(feature.is_real_example)])
        if isinstance(feature.label_ids, list):
            label_ids = feature.label_ids
        else:
            label_ids = feature.label_ids[0]
        features["label_ids"] = create_int_feature(label_ids)

        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString())
    writer.close()

##### `file_based_input_fn_builder`

In [None]:
def file_based_input_fn_builder(input_file, seq_length, is_training,
                                drop_remainder, number_of_features):
    """Creates an `input_fn` closure to be passed to TPUEstimator."""

    name_to_features = {
        "input_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "input_mask": tf.FixedLenFeature([seq_length], tf.int64),
        "segment_ids": tf.FixedLenFeature([seq_length], tf.int64),
        "label_ids": tf.FixedLenFeature([number_of_features], tf.int64),
        "is_real_example": tf.FixedLenFeature([], tf.int64),
    }

    def _decode_record(record, name_to_features):
        """Decodes a record to a TensorFlow example."""
        example = tf.parse_single_example(record, name_to_features)

        # tf.Example only supports tf.int64, but the TPU only supports tf.int32.
        # So cast all int64 to int32.
        for name in list(example.keys()):
            t = example[name]
            if t.dtype == tf.int64:
                t = tf.to_int32(t)
            example[name] = t

        return example

    def input_fn(params):
        """The actual input function."""
        batch_size = params["batch_size"]

        # For training, we want a lot of parallel reading and shuffling.
        # For eval, we want no shuffling and parallel reading doesn't matter.
        d = tf.data.TFRecordDataset(input_file)
        if is_training:
            d = d.repeat()
            d = d.shuffle(buffer_size=100)

        d = d.apply(
            tf.contrib.data.map_and_batch(
                lambda record: _decode_record(record, name_to_features),
                batch_size=batch_size,
                drop_remainder=drop_remainder))

        return d

    return input_fn

fn `_truncate_seq_pair`

In [None]:
def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""

    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()

### 3.1.2 Crear Tokenizador

In [None]:
print(dict_bert_params['init_chkpnt'])
print(dict_bert_params['vocab'])

/content/drive/My Drive/BERT/models/biobert_v1.1_pubmed/model.ckpt-1000000
/content/drive/My Drive/BERT/models/biobert_v1.1_pubmed/vocab.txt


In [None]:
# 1) Validar Tokenizacion del Modelo
tokenization.validate_case_matches_checkpoint(True, dict_bert_params['init_chkpnt'])

# 2) Crear Tokenizador
tokenizer = tokenization.FullTokenizer(vocab_file=dict_bert_params['vocab'], do_lower_case=True)

# 3) Probar Tokenizador con oración Dummy (en inglés)
tokenizer.tokenize("This here's an example of using the BERT tokenizer")




['this',
 'here',
 "'",
 's',
 'an',
 'example',
 'of',
 'using',
 'the',
 'be',
 '##rt',
 'token',
 '##izer']

In [None]:
# 4) Verificar Longitud actual del Vocabulario
tokens_vocabulario_inicial = len(tokenizer.vocab)
print("Tokens Vocabulario Inicial:", tokens_vocabulario_inicial)

Tokens Vocabulario Inicial: 28996


## 3.2 Preparación de Datos

Se utiliza el modelo BERT (Bio BERT) previamente entrenado y lo afinamos para nuestra tarea de clasificación.

Básicamente cargamos el modelo pre-entrenado y luego entrenamos la última capa para la tarea de clasificación.

### 3.2.1 Dividir data para Entrenamiento, Validación y Prueba

In [None]:
# Dividir 75% Entrenamiento y 25% Test
# https://pythonhealthcare.org/2018/12/22/112-splitting-data-set-into-training-and-test-sets-using-pandas-dataframes-methods/
print("# registros Dataset Original:", len(df_covid), "\n")

# 1) Crear dataframe de Entrenamiento (train)
train = df_covid.sample(frac=0.75, random_state=0)
#print("# registros Dataset Train:", len(train))

# 2) Crear dataframe para Prueba (test)
x_test = df_covid.drop(train.index)
print("# registros Dataset Test:", len(x_test), "shape:", x_test.shape )

# 3) Crear dataframe de Validación (validation), tomando el 20% del de 'train'
x_validation = train.sample(frac=0.1, random_state=0)
print("# registros Dataset Validación:", len(x_validation), "shape:", x_validation.shape)

x_train = train.drop(x_validation.index)
print("# registros Dataset Train:", len(x_train), "shape:", x_train.shape)

# registros Dataset Original: 48510 

# registros Dataset Test: 12128 shape: (12128, 90)
# registros Dataset Validación: 3638 shape: (3638, 90)
# registros Dataset Train: 32744 shape: (32744, 90)


In [None]:
print(x_train.shape)
print(x_validation.shape)
print(x_test.shape)

(32744, 90)
(3638, 90)
(12128, 90)


### 3.2.2 Crear Muestras de Entrenamiento en formato BERT: `InputExample`

In [None]:
#def create_examples(df, idx_column_index, idx_column_text, idx_columns_labels):
print("ID", ID)

idx_column_text =  df_covid.columns.get_loc(DATA_COLUMN)
idx_columns_labels = [[df_covid.columns.get_loc(col)] for col in LABEL_COLUMNS] 
idx_column_index = df_covid.columns.get_loc(ID)
idx_columns_labels = [df_covid.columns.get_loc(col) for col in LABEL_COLUMNS] 

print("idx_column_index:", idx_column_index)
print("idx_column_text:", idx_column_text)
print("LABEL_COLUMNS:", LABEL_COLUMNS)
print("idx_columns_labels:", idx_columns_labels)

ID cord_uid
idx_column_index: 0
idx_column_text: 1
LABEL_COLUMNS: ['activation', 'activity', 'age', 'antibody', 'antigen', 'approach', 'background', 'base', 'binding', 'calf', 'cell_line', 'child', 'complex', 'conclusion', 'country', 'covid', 'day', 'de', 'detect', 'detection', 'development', 'diarrhea', 'effect', 'express', 'expression', 'gene', 'genome', 'global', 'health', 'increase', 'inhibit', 'interaction', 'le', 'mechanism', 'model', 'mouse', 'national', 'need', 'objective', 'patient', 'pig', 'population', 'public_health', 'que', 'replication', 'research', 'respiratory', 'review', 'role', 'sample', 'sarscov', 'sequence', 'severe_acute_respiratory', 'significantly', 'structure', 'surveillance', 'total', 'understand', 'vaccine', 'vitro']
idx_columns_labels: [29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85

In [None]:
##%%time
#def create_examples(df, index_column, text_column, label_columns):
train_examples = create_examples(x_train, idx_column_index, idx_column_text, idx_columns_labels)
len(train_examples)

32744

### 3.2.3 Parámetros de Entrenamiento 

In [None]:
# We'll set sequences to be at most 128 tokens long.
MAX_SEQ_LENGTH = 128

# Compute train and warmup steps from batch size
# These hyperparameters are copied from this colab notebook (https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)
BATCH_SIZE = 32
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 1.0
# Warmup is a period of time where hte learning rate 
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 1000
SAVE_SUMMARY_STEPS = 500

## 3.3 Entrenar el modelo (Train)

In [None]:
# Compute # train and warmup steps from batch size
num_train_steps = int(len(train_examples) / BATCH_SIZE * NUM_TRAIN_EPOCHS)
num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

In [None]:
# Abrir archivo para la salida del proceso de entrenamiento
train_file = os.path.join(path_folder_train_output, "train.tf_record")

# Crear el archivo si no existe
if not os.path.exists(train_file):
    open(train_file, 'w').close()

In [None]:
# Convertir las Muestras a Características (Features)
%%time
file_based_convert_examples_to_features(
            train_examples, MAX_SEQ_LENGTH, tokenizer, train_file)
tf.logging.info("***** Running training *****")
tf.logging.info("  Num examples = %d", len(train_examples))
tf.logging.info("  Batch size = %d", BATCH_SIZE)
tf.logging.info("  Num steps = %d", num_train_steps)

INFO:tensorflow:***** Running training *****
INFO:tensorflow:  Num examples = 32744
INFO:tensorflow:  Batch size = 32
INFO:tensorflow:  Num steps = 1023
CPU times: user 2min 1s, sys: 150 ms, total: 2min 2s
Wall time: 2min 4s


In [None]:
# Creates an `input_fn` closure to be passed to TPUEstimator
train_input_fn = file_based_input_fn_builder(
    input_file=train_file,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=True,
    number_of_features=NUM_OF_FEATURES)

#### 3.3.1 Configuración de TensorFlow Estimator

In [None]:
# Specify output directory and number of checkpoint steps to save
run_config = tf.estimator.RunConfig(
    model_dir= path_folder_model_tuned,
    save_summary_steps = SAVE_SUMMARY_STEPS,
    keep_checkpoint_max = 1,
    save_checkpoints_steps = SAVE_CHECKPOINTS_STEPS)

In [None]:
# Cargar archivo de configuración de BERT
bert_config = modeling.BertConfig.from_json_file( dict_bert_params['config'] )

model_fn = model_fn_builder(
  bert_config= bert_config,
  num_labels= NUM_OF_FEATURES, #len(LABEL_COLUMNS)
  init_checkpoint= dict_bert_params['init_chkpnt'] ,
  learning_rate= LEARNING_RATE,
  num_train_steps= num_train_steps,
  num_warmup_steps= num_warmup_steps,
  use_tpu= False,
  use_one_hot_embeddings= False)

estimator = tf.estimator.Estimator(
  model_fn= model_fn,
  config= run_config,
  params={"batch_size": BATCH_SIZE})

INFO:tensorflow:Using config: {'_model_dir': '/content/drive/My Drive/BERT/models/fine-tuned/model/', '_tf_random_seed': None, '_save_summary_steps': 500, '_save_checkpoints_steps': 1000, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 1, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8c5f680f60>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


### 3.3.2 Afinar Modelo

In [None]:
print(f'Beginning Training!')
current_time = datetime.now()
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print("Training took time ", datetime.now() - current_time)

Beginning Training!
INFO:tensorflow:Skipping training since max_steps has already saved.
Training took time  0:00:03.755697


## 3.4 Evaluar el modelo (Validate)

In [None]:
%%time
eval_file = os.path.join(path_folder_eval_output, "eval.tf_record")
#filename = Path(train_file)
if not os.path.exists(eval_file):
    open(eval_file, 'w').close()

eval_examples = create_examples(x_validation, idx_column_index, idx_column_text, idx_columns_labels)
file_based_convert_examples_to_features(
    eval_examples, MAX_SEQ_LENGTH, tokenizer, eval_file)

CPU times: user 13.6 s, sys: 21.9 ms, total: 13.6 s
Wall time: 14.8 s


In [None]:
# This tells the estimator to run through the entire set.
eval_steps = None
eval_drop_remainder = False

eval_input_fn = file_based_input_fn_builder(
    input_file=eval_file,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=eval_drop_remainder,
    number_of_features=NUM_OF_FEATURES)

print("Evaluate")
result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)

Evaluate
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
Instructions for updating:
Use `tf.cast` instead.
INFO:tensorflow:Calling model_fn.



Instructions for updating:
Use keras.layers.Dense instead.
Instructions for updating:
Please use `layer.__call__` method instead.

type(output_layer): <class 'tensorflow.python.framework.ops.Tensor'>
output_layer:
 Tensor("bert/pooler/dens

In [None]:
# Show the results of the Evaluation
output_eval_file = os.path.join(path_folder_eval_output, "eval_results.txt")
with tf.gfile.GFile(output_eval_file, "w") as writer:
    tf.logging.info("***** Eval results *****")
    for key in sorted(result.keys()):
        tf.logging.info("  %s = %s", key, str(result[key]))
        writer.write("%s = %s\n" % (key, str(result[key])))

INFO:tensorflow:***** Eval results *****
INFO:tensorflow:  0 = 0.96334314
INFO:tensorflow:  1 = 0.97171515
INFO:tensorflow:  10 = 0.60036504
INFO:tensorflow:  11 = 0.9630305
INFO:tensorflow:  12 = 0.78747934
INFO:tensorflow:  13 = 0.84463394
INFO:tensorflow:  14 = 0.7683457
INFO:tensorflow:  15 = 0.9156571
INFO:tensorflow:  16 = 0.59959704
INFO:tensorflow:  17 = 0.9762473
INFO:tensorflow:  18 = 0.9494256
INFO:tensorflow:  19 = 0.9440703
INFO:tensorflow:  2 = 0.48952392
INFO:tensorflow:  20 = 0.7793925
INFO:tensorflow:  21 = 0.9708494
INFO:tensorflow:  22 = 0.49743307
INFO:tensorflow:  23 = 0.9190171
INFO:tensorflow:  24 = 0.89971155
INFO:tensorflow:  25 = 0.96246517
INFO:tensorflow:  26 = 0.96388084
INFO:tensorflow:  27 = 0.84461254
INFO:tensorflow:  28 = 0.9744985
INFO:tensorflow:  29 = 0.52944237
INFO:tensorflow:  3 = 0.9504507
INFO:tensorflow:  30 = 0.97148883
INFO:tensorflow:  31 = 0.73147434
INFO:tensorflow:  32 = 0.9743613
INFO:tensorflow:  33 = 0.9640601
INFO:tensorflow:  34 = 0

## 3.5 Probar el modelo (Test)

### 3.5.1 Predecir una muestra de ejemplos 

In [None]:
# Print Current length of the Test Set
print(len(x_test))

# Testing only a small sample
# https://stackoverflow.com/questions/29576430/shuffle-dataframe-rows
x_test = x_test.sample(frac=0.2).reset_index(drop=True)

# Convert the data example instances of class InputExample
predict_examples = create_examples(x_test, idx_column_index, idx_column_text, idx_columns_labels)

12128


In [None]:
# Convert Examples to Features
test_features = convert_examples_to_features(
    predict_examples,
    MAX_SEQ_LENGTH,
    tokenizer)

Output hidden; open in https://colab.research.google.com to view.

In [None]:
def input_fn_builder(features, seq_length, is_training, drop_remainder):
  """Creates an `input_fn` closure to be passed to TPUEstimator."""

  all_input_ids = []
  all_input_mask = []
  all_segment_ids = []
  all_label_ids = []

  for feature in features:
    all_input_ids.append(feature.input_ids)
    all_input_mask.append(feature.input_mask)
    all_segment_ids.append(feature.segment_ids)
    all_label_ids.append(feature.label_ids)

  def input_fn(params):
    """The actual input function."""
    batch_size = params["batch_size"]

    num_examples = len(features)

    # This is for demo purposes and does NOT scale to large data sets. We do
    # not use Dataset.from_generator() because that uses tf.py_func which is
    # not TPU compatible. The right way to load data is with TFRecordReader.
    d = tf.data.Dataset.from_tensor_slices({
        "input_ids":
            tf.constant(
                all_input_ids, shape=[num_examples, seq_length],
                dtype=tf.int32),
        "input_mask":
            tf.constant(
                all_input_mask,
                shape=[num_examples, seq_length],
                dtype=tf.int32),
        "segment_ids":
            tf.constant(
                all_segment_ids,
                shape=[num_examples, seq_length],
                dtype=tf.int32),
        "label_ids":
            tf.constant(all_label_ids, shape=[num_examples, len(LABEL_COLUMNS)], dtype=tf.int32),
    })

    if is_training:
      d = d.repeat()
      d = d.shuffle(buffer_size=100)

    d = d.batch(batch_size=batch_size, drop_remainder=drop_remainder)
    return d

  return input_fn

In [None]:
%%time
print('Beginning Predictions!')
current_time = datetime.now()

predict_input_fn = input_fn_builder(features=test_features, seq_length=MAX_SEQ_LENGTH, is_training=False, drop_remainder=False)
predictions = estimator.predict(predict_input_fn)
print("Prediction took time ", datetime.now() - current_time)

Beginning Predictions!
Prediction took time  0:00:00.001317
CPU times: user 1.51 ms, sys: 0 ns, total: 1.51 ms
Wall time: 1.48 ms


In [None]:
def create_output(predictions):
    probabilities = []
    for (i, prediction) in enumerate(predictions):
        preds = prediction["probabilities"]
        probabilities.append(preds)
    dff = pd.DataFrame(probabilities)
    dff.columns = LABEL_COLUMNS
    
    return dff

In [None]:
path_folder_output_dataframes = path_folder_root + 'NLP/projects/bert/output/'
path_folder_output_dataframes

'/content/drive/My Drive/NLP/projects/bert/output/'

In [None]:
# Generar DataFrame con Predicciones
output_df = create_output(predictions)
output_df.sample(3)

INFO:tensorflow:Calling model_fn.

type(output_layer): <class 'tensorflow.python.framework.ops.Tensor'>
output_layer:
 Tensor("bert/pooler/dense/Tanh:0", shape=(?, 768), dtype=float32)

type(hidden_size): <class 'int'>
hidden_size:
 768

type(output_weights):
 <class 'tensorflow.python.ops.variables.RefVariable'>
output_weights:
 <tf.Variable 'output_weights:0' shape=(60, 768) dtype=float32_ref>

type(logits): <class 'tensorflow.python.framework.ops.Tensor'>
logits: Tensor("loss/BiasAdd:0", shape=(?, 60), dtype=float32)
INFO:tensorflow:
num_labels:60;
logits:Tensor("loss/BiasAdd:0", shape=(?, 60), dtype=float32);
labels:Tensor("loss/Cast:0", shape=(?, 60), dtype=float32)
INFO:tensorflow:**** Trainable Variables ****
mode: infer probabilities: Tensor("loss/Sigmoid:0", shape=(?, 60), dtype=float32)
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /content/drive/My Drive/BERT/models/fine-tuned/model/model.ckpt-1023
INFO:

Unnamed: 0,activation,activity,age,antibody,antigen,approach,background,base,binding,calf,cell_line,child,complex,conclusion,country,covid,day,de,detect,detection,development,diarrhea,effect,express,expression,gene,genome,global,health,increase,inhibit,interaction,le,mechanism,model,mouse,national,need,objective,patient,pig,population,public_health,que,replication,research,respiratory,review,role,sample,sarscov,sequence,severe_acute_respiratory,significantly,structure,surveillance,total,understand,vaccine,vitro
1405,0.041591,0.10926,0.025052,0.046383,0.033486,0.028322,0.019834,0.036013,0.139845,0.087812,0.040006,0.049463,0.033714,0.023504,0.032044,0.055475,0.01414,0.014198,0.037706,0.042095,0.043863,0.114097,0.017213,0.040571,0.0562,0.438958,0.428051,0.043518,0.035001,0.035014,0.115531,0.037769,0.012798,0.060211,0.031389,0.046463,0.027652,0.030082,0.030493,0.035361,0.068078,0.036574,0.035385,0.013365,0.035915,0.022329,0.036405,0.044741,0.065566,0.045612,0.026918,0.46628,0.063731,0.026788,0.031047,0.030203,0.024833,0.051557,0.027521,0.052949
942,0.120588,0.123509,0.028187,0.133583,0.16195,0.028609,0.022093,0.028761,0.135795,0.138604,0.05035,0.043997,0.035669,0.025159,0.02095,0.0436,0.017941,0.022145,0.030287,0.047734,0.049557,0.136181,0.032763,0.14585,0.138025,0.074183,0.061698,0.036155,0.031009,0.030791,0.147415,0.035165,0.018388,0.124654,0.027663,0.13803,0.025215,0.032055,0.027865,0.029104,0.119084,0.031828,0.022492,0.011466,0.060252,0.022397,0.034016,0.042538,0.11544,0.031001,0.040533,0.055019,0.038724,0.029098,0.045538,0.02629,0.020308,0.044891,0.142554,0.047108
1841,0.045517,0.121265,0.02484,0.044935,0.033735,0.027962,0.019414,0.037384,0.151539,0.083295,0.040427,0.049053,0.034777,0.023272,0.032105,0.055671,0.014383,0.014278,0.036297,0.041366,0.044049,0.104777,0.017221,0.041872,0.055703,0.444396,0.428081,0.044775,0.033921,0.034308,0.127341,0.037962,0.012976,0.064073,0.031934,0.047638,0.026953,0.030479,0.030346,0.034076,0.063247,0.035908,0.035312,0.013286,0.035517,0.022349,0.034891,0.047031,0.069399,0.044435,0.026245,0.46694,0.060374,0.02637,0.031505,0.028839,0.024809,0.049349,0.027159,0.052755


### 3.5.2 Verificar Precisión de la Predicción

In [None]:
# 1. Agregar columna 'cord_uid' a dataset de predicciones
output_df = pd.concat([x_test['cord_uid'], output_df], axis=1)

# 2. Establecer como índice de 'output_df' la columna 'cord_uid'
output_df.set_index('cord_uid')

Unnamed: 0_level_0,activation,activity,age,antibody,antigen,approach,background,base,binding,calf,cell_line,child,complex,conclusion,country,covid,day,de,detect,detection,development,diarrhea,effect,express,expression,gene,genome,global,health,increase,inhibit,interaction,le,mechanism,model,mouse,national,need,objective,patient,pig,population,public_health,que,replication,research,respiratory,review,role,sample,sarscov,sequence,severe_acute_respiratory,significantly,structure,surveillance,total,understand,vaccine,vitro
cord_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1
ibwkkeud,0.028667,0.027395,0.023224,0.044958,0.033098,0.022080,0.032954,0.013139,0.027864,0.176477,0.030703,0.418374,0.016769,0.032697,0.020871,0.091439,0.025780,0.015459,0.065483,0.057762,0.024293,0.187056,0.028501,0.026803,0.036035,0.056498,0.056821,0.025014,0.045406,0.026315,0.026052,0.022519,0.014705,0.036504,0.017958,0.025156,0.029943,0.020735,0.032864,0.434070,0.208711,0.027325,0.030699,0.022605,0.023384,0.023339,0.378751,0.022428,0.029190,0.050150,0.074052,0.044608,0.075881,0.030168,0.024626,0.026151,0.019880,0.031329,0.037387,0.028950
xeaqfxqj,0.031622,0.034656,0.034714,0.014270,0.025053,0.069526,0.032488,0.064269,0.034312,0.030154,0.023169,0.033309,0.025140,0.034914,0.028059,0.076332,0.035276,0.018100,0.029244,0.023362,0.048411,0.024911,0.020315,0.026599,0.022781,0.035283,0.029257,0.094938,0.417678,0.015700,0.029135,0.026257,0.025072,0.040976,0.073902,0.029083,0.488574,0.095009,0.040243,0.036545,0.036948,0.029299,0.461766,0.020021,0.019074,0.143071,0.049306,0.054881,0.032436,0.025182,0.069045,0.024590,0.071571,0.019571,0.027153,0.034595,0.023888,0.044110,0.020255,0.022392
jwcgf3op,0.031740,0.058197,0.035980,0.017096,0.032305,0.149114,0.022616,0.144578,0.053514,0.054769,0.040866,0.028896,0.050778,0.037108,0.049245,0.065606,0.034802,0.019007,0.032973,0.022633,0.083712,0.036522,0.023969,0.032815,0.036137,0.073311,0.078480,0.105515,0.104124,0.024314,0.049374,0.053752,0.026354,0.031908,0.152984,0.033228,0.105098,0.112298,0.036444,0.022421,0.051372,0.042156,0.082446,0.023972,0.026365,0.112591,0.019507,0.101137,0.038805,0.027919,0.044274,0.072099,0.062128,0.030498,0.050538,0.049096,0.032893,0.093189,0.019133,0.027454
2h8yfriz,0.422114,0.120696,0.026155,0.038293,0.061696,0.022371,0.020204,0.023885,0.135023,0.094671,0.030747,0.035876,0.034473,0.018403,0.022015,0.065028,0.022227,0.017724,0.020309,0.030704,0.040903,0.089578,0.029620,0.088471,0.066863,0.060130,0.057775,0.027480,0.037521,0.029344,0.128517,0.028267,0.014021,0.402095,0.025927,0.081736,0.023447,0.031335,0.020003,0.032259,0.069239,0.034502,0.033964,0.008540,0.041718,0.024934,0.029741,0.043482,0.428025,0.028063,0.045402,0.044760,0.034346,0.025786,0.030872,0.019079,0.019356,0.032553,0.052100,0.025562
yz9yhnbx,0.078960,0.047074,0.031817,0.028189,0.032203,0.026648,0.040866,0.030593,0.045312,0.047382,0.024783,0.083727,0.021833,0.022293,0.025748,0.265174,0.026135,0.014316,0.034065,0.031791,0.022722,0.044324,0.032725,0.025346,0.028998,0.033580,0.037104,0.027040,0.083454,0.023276,0.041405,0.020224,0.015597,0.081360,0.033784,0.032933,0.058154,0.027203,0.018250,0.089469,0.046211,0.026701,0.090906,0.015312,0.026168,0.026969,0.096431,0.028088,0.080594,0.025568,0.321814,0.028325,0.254216,0.031538,0.018335,0.020674,0.021570,0.031888,0.031802,0.023979
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
kwikwxge,0.025401,0.024342,0.046884,0.031886,0.036740,0.086367,0.043152,0.100205,0.033800,0.068094,0.032854,0.044785,0.031866,0.040469,0.041602,0.284086,0.043751,0.017538,0.042516,0.031039,0.033869,0.061107,0.030733,0.021752,0.031646,0.043779,0.048261,0.047891,0.102448,0.030701,0.030012,0.038065,0.021656,0.029374,0.141926,0.027040,0.092238,0.059617,0.037885,0.044384,0.072103,0.041043,0.097543,0.023713,0.026036,0.048353,0.052555,0.040428,0.028612,0.026593,0.347442,0.046008,0.322206,0.046490,0.029372,0.031823,0.038471,0.051011,0.025200,0.031354
uznmbvom,0.038476,0.099524,0.030632,0.034564,0.035498,0.057381,0.020539,0.071617,0.105384,0.075409,0.041713,0.034644,0.046918,0.027436,0.034486,0.048266,0.017091,0.015509,0.039171,0.037551,0.063660,0.072040,0.016928,0.043230,0.053790,0.349606,0.375680,0.065441,0.039654,0.030657,0.091175,0.045166,0.016111,0.053700,0.068404,0.053290,0.036223,0.050875,0.033998,0.021949,0.057529,0.038790,0.038341,0.014936,0.036947,0.035089,0.022336,0.063851,0.059813,0.039550,0.025163,0.384988,0.056649,0.028686,0.040190,0.034352,0.027581,0.063193,0.025094,0.046353
e3zx2kh5,0.407488,0.208614,0.019842,0.023646,0.038713,0.021210,0.015643,0.024563,0.224348,0.046965,0.024613,0.025459,0.030425,0.013605,0.019392,0.058503,0.018359,0.013780,0.015165,0.021001,0.033749,0.046721,0.024560,0.060282,0.044387,0.053567,0.048548,0.024818,0.034688,0.024203,0.215641,0.022753,0.010248,0.396061,0.023182,0.055573,0.027013,0.027214,0.016064,0.025348,0.038488,0.024786,0.038241,0.006880,0.033173,0.024790,0.022037,0.040448,0.431774,0.020704,0.040193,0.039901,0.033855,0.017466,0.027061,0.015767,0.019572,0.027575,0.032675,0.020129
99q1bhba,0.322154,0.260354,0.017034,0.023164,0.036604,0.017511,0.014398,0.022295,0.264370,0.044094,0.023943,0.023461,0.029588,0.012370,0.018057,0.047451,0.017136,0.012865,0.015253,0.019307,0.032827,0.048885,0.022740,0.057086,0.046284,0.052824,0.056590,0.024075,0.025389,0.022675,0.265787,0.022887,0.009638,0.302947,0.020468,0.055049,0.021600,0.024383,0.014684,0.023592,0.038412,0.021314,0.029068,0.006823,0.033861,0.021422,0.019690,0.037892,0.347338,0.019967,0.033686,0.041852,0.031453,0.015452,0.027819,0.014654,0.017950,0.026743,0.032769,0.020329


In [None]:
# 3. Determinar columnas a eliminar

# a) Obtener todas las columnas del dataframe
cols = x_test.columns.values.tolist()

# b) Especificar columnas a conservar
# NOTA: dinámicamente se incluirán/conservarán las columnas que comiencen con 'word'
cols_top_n_words = list(filter(lambda c: 'word_' in c, cols))
cols_conservar = ['cord_uid','document','clean','clean_tfidf','best_topic','best_topic_score'] + cols_top_n_words
print(cols_conservar)

# c) Obtener columnas a eliminar
cols_eliminar = [c for c in cols if c not in cols_conservar]
print(cols_eliminar)

['cord_uid', 'document', 'clean', 'clean_tfidf', 'best_topic', 'best_topic_score', 'word_0', 'word_1', 'word_2']
[]


In [None]:
# 4. Remover Columnas
x_test.drop(cols_eliminar, axis=1, inplace=True)

# 5. Establecer como índice de 'output_df' la columna 'cord_uid'
x_test.set_index('cord_uid')

Unnamed: 0_level_0,document,clean,clean_tfidf,best_topic,best_topic_score,word_0,word_1,word_2
cord_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ibwkkeud,6 Gastroenteritis. Publisher Summary Acute gas...,"[[gastroenteritis], [publisher_summary, acute_...","[gastroenteritis, publisher_summary, acute_gas...",topic_5,-3.739231e-12,diarrhea,pig,calf
xeaqfxqj,Society–Space. Our conception of society–space...,"[[societyspace], [conception, societyspace, de...","[determines, point, transform, social, world, ...",topic_0,-2.012044e-08,health,public_health,national
jwcgf3op,Accessible areas in ecological niche compariso...,"[[accessible, area, ecological, niche, compari...","[accessible, area, ecological, niche, comparis...",topic_13,-2.714867e-06,research,need,global
2h8yfriz,Persistent Foot-and-Mouth Disease Virus Infect...,"[[persistent, footandmouth_disease, virus, inf...","[persistent, footandmouth_disease, nasopharynx...",topic_19,-2.622129e-08,replication,vitro,cell_line
yz9yhnbx,The SARS-CoV-2 receptor ACE2 expression of mat...,"[[sarscov, receptor, ace, expression, maternal...","[sarscov, receptor, ace, expression, interface...",topic_8,-6.039613e-14,severe_acute_respiratory,sarscov,covid
...,...,...,...,...,...,...,...,...
kwikwxge,Epidemiological characteristics and transmissi...,"[[epidemiological, characteristic, transmissib...","[epidemiological, characteristic, transmissibi...",topic_8,-6.181722e-13,severe_acute_respiratory,sarscov,covid
uznmbvom,Sex in a test tube: testing the benefits of in...,"[[sex, test, tube, test, benefit, vitro, recom...","[sex, test, tube, test, benefit, vitro, recomb...",topic_4,-8.304468e-12,sequence,genome,gene
e3zx2kh5,Angiotensin converting enzyme 2 in the brain: ...,"[[angiotensin_convert, enzyme, brain, property...","[angiotensin_convert, enzyme, brain, property,...",topic_3,0.000000e+00,activation,mechanism,role
99q1bhba,Pharmacological and Biological Antiviral Thera...,"[[pharmacological, biological, antiviral, ther...","[pharmacological, biological, antiviral, thera...",topic_2,0.000000e+00,binding,activity,inhibit


In [None]:
# 6. Determinar las columnas que constituyen las equitetas (features) de clasificación
labels = output_df.columns.values.tolist()
labels.remove('cord_uid')
print(labels)

['activation', 'activity', 'age', 'antibody', 'antigen', 'approach', 'background', 'base', 'binding', 'calf', 'cell_line', 'child', 'complex', 'conclusion', 'country', 'covid', 'day', 'de', 'detect', 'detection', 'development', 'diarrhea', 'effect', 'express', 'expression', 'gene', 'genome', 'global', 'health', 'increase', 'inhibit', 'interaction', 'le', 'mechanism', 'model', 'mouse', 'national', 'need', 'objective', 'patient', 'pig', 'population', 'public_health', 'que', 'replication', 'research', 'respiratory', 'review', 'role', 'sample', 'sarscov', 'sequence', 'severe_acute_respiratory', 'significantly', 'structure', 'surveillance', 'total', 'understand', 'vaccine', 'vitro']


In [None]:
# 7. Crear columna con las N palabras principales pronosticadas por el modelo 
x_test['top_n_predicted_words'] = output_df[labels].apply(lambda s: s.nlargest(top_n_words).index.tolist(), axis=1)

In [None]:
# 8. Crear columnas para auxiliares para determinar precisión de la predicción del modelo
 
# a) Crear columnas para las N palabras principales pronosticadas
cols_predicted_words = []
cols_accuracy_predicted_words = []

for i in range(top_n_words):
  # Columna con la palabra pronosticada
  col_name = 'predicted_word_' + str(i)
  cols_predicted_words.append(col_name)
  x_test[col_name] = ""
  # Columna con la precisión de la palabra pronosticada
  col_name = 'accuracy_predicted_word_' + str(i)
  cols_accuracy_predicted_words.append(col_name)
  x_test[col_name] = 0

# b) Crear columna para la precisión de la predicción
x_test['predicted_words_accuracy'] = 0.0

# c) Crear columna con total de ocurrencias de palabras pronosticadas
x_test['predicted_words_occurrences'] = 0

In [None]:
# 9. Definir función que actualiza la información de las palabras pronosticadas
def update_predicted_words(row, cols_predicted_words, cols_accuracy_predicted_words):
  row_id = row['cord_uid']
  #print(f"\nrow id: {row_id}")
  try:
    list_series_labels = ['cord_uid']
    list_series_values = [row_id]    
    list_predicted_words = list(row['top_n_predicted_words'])
    #print("list_predicted_words",list_predicted_words)
    
    len_predicted_words = len(cols_predicted_words)
    topic_words = []  
    accuracy_predictions = []

    for i in range(len_predicted_words):      
      col_name = cols_predicted_words[i]
      #print(f"predicted col_name={col_name}")

      predicted_word = list_predicted_words[i]
      topic_word = row['word_' + str(i)]
      #print("predicted_word:", predicted_word)
      #print("topic_word:", topic_word)
      #print(f"[{i}]: predicted_word={predicted_word}, topic_word={topic_word}")

      # Agregar palabra de topico a lista topic_words
      topic_words.append(topic_word)

      # Actualizar Columna Palabra Predecida N con palabra correspondiente
      row[col_name] = predicted_word
      list_series_labels.append(col_name)
      list_series_values.append(predicted_word)
      
      # Actualizar Precisión de Predicción N
      col_name = cols_accuracy_predicted_words[i]
      accuracy_predicted_word = 1 if topic_word == predicted_word else 0 
      accuracy_predictions.append(accuracy_predicted_word)

      list_series_labels.append(col_name)
      list_series_values.append(accuracy_predicted_word)

    # Actualizar num. de ocurrencias de palabras predecidas
    num_occurrences = len(set(list_predicted_words) & set(topic_words))  
    list_series_labels.append('predicted_words_occurrences')
    list_series_values.append(num_occurrences)

    # Actualizar total de aciertos de palabras predecidas (acuracy)
    accuracy = sum(accuracy_predictions) / len(accuracy_predictions)

    #row['predicted_words_accuracy'] = accuracy
    list_series_labels.append('predicted_words_accuracy')
    list_series_values.append(accuracy)

    # Pandas Dataframe: How to update multiple columns by applying a function?
    # https://stackoverflow.com/questions/32603051/pandas-dataframe-how-to-update-multiple-columns-by-applying-a-function

    # https://www.geeksforgeeks.org/creating-a-pandas-series-from-lists/
    updated_series = pd.Series(list_series_values, index =list_series_labels)
    return updated_series
  except Exception as error:
    print("Error ", error.__class__, "in row id", row_id, "occurred.")
    print(error)

In [None]:
# 10. Crear Dataframe temporal con la información actualizada de las palabras pronosticadas
df = x_test.apply(lambda row: update_predicted_words(row, cols_predicted_words, cols_accuracy_predicted_words), axis=1)
df.set_index('cord_uid')

Unnamed: 0_level_0,predicted_word_0,accuracy_predicted_word_0,predicted_word_1,accuracy_predicted_word_1,predicted_word_2,accuracy_predicted_word_2,predicted_words_occurrences,predicted_words_accuracy
cord_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ibwkkeud,patient,0,child,0,respiratory,0,0,0.000000
xeaqfxqj,national,0,public_health,1,health,0,3,0.333333
jwcgf3op,model,0,approach,0,base,0,0,0.000000
2h8yfriz,role,0,activation,0,mechanism,0,0,0.000000
yz9yhnbx,sarscov,0,covid,0,severe_acute_respiratory,0,3,0.000000
...,...,...,...,...,...,...,...,...
kwikwxge,sarscov,0,severe_acute_respiratory,0,covid,1,3,0.333333
uznmbvom,sequence,1,genome,1,gene,1,3,1.000000
e3zx2kh5,role,0,activation,0,mechanism,0,3,0.000000
99q1bhba,role,0,activation,0,mechanism,0,0,0.000000


In [None]:
# 11. Determinar columnas a actualizar en el dataframe principal
cols_to_update = cols_predicted_words + cols_accuracy_predicted_words
cols_to_update.append('predicted_words_occurrences')
cols_to_update.append('predicted_words_accuracy')

cols_to_update

['predicted_word_0',
 'predicted_word_1',
 'predicted_word_2',
 'accuracy_predicted_word_0',
 'accuracy_predicted_word_1',
 'accuracy_predicted_word_2',
 'predicted_words_occurrences',
 'predicted_words_accuracy']

In [None]:
# 12. Actualizar dataframe principal (de testing)
x_test[cols_to_update] = df[cols_to_update]
x_test.sample(3)

Unnamed: 0,cord_uid,document,clean,clean_tfidf,best_topic,best_topic_score,word_0,word_1,word_2,top_n_predicted_words,predicted_word_0,accuracy_predicted_word_0,predicted_word_1,accuracy_predicted_word_1,predicted_word_2,accuracy_predicted_word_2,predicted_words_accuracy,predicted_words_occurrences
498,vxihrg6q,Kawasaki disease may be a hyperimmune reaction...,"[[kawasaki, disease, may, hyperimmune, reactio...","[kawasaki, hyperimmune, reaction, genetically,...",topic_1,-0.000168,patient,respiratory,child,"[patient, respiratory, child]",patient,1,respiratory,1,child,1,1.0,3
552,9b11c0od,Twenty-five years of type I interferon-based t...,"[[twentyfive, year, type, interferonbased, tre...","[twentyfive, year, type, treatment, critical, ...",topic_3,-0.001842,activation,mechanism,role,"[role, activation, mechanism]",role,0,activation,0,mechanism,0,0.0,3
1606,krhb0wzf,Enterovirus D68 detection in respiratory speci...,"[[enterovirus, detection, respiratory, specime...","[enterovirus, detection, respiratory, specimen...",topic_1,0.0,patient,respiratory,child,"[child, patient, sample]",child,0,patient,0,sample,0,0.0,2


In [None]:
# 13. Obtener estadísticas
# Stats
average_accuracy_pct = x_test['predicted_words_accuracy'].mean() * 100
average_predicted_words_occurrence = x_test['predicted_words_occurrences'].mean()
average_predicted_words_occurrence_pct = average_predicted_words_occurrence / top_n_words * 100

print("Precisión porcentual promedio: {:.2f}%".format(average_accuracy_pct))
print("Número promedio de ocurrencias de palabras pronosticadas: {:.2f}".format(average_predicted_words_occurrence))
print("Porcentaje promedio de ocurrencia de palabras pronosticadas: {:.2f}%".format(average_predicted_words_occurrence_pct))

Precisión porcentual promedio : 15.11%
Número promedio de ocurrencias de palabras pronosticadas: 1.83
Porcentaje promedio de ocurrencia de palabras pronosticadas: 60.84%


# 4. Conclusiones

1. Se ha procedido con el afinamiento del modelo pre-entrenado BioBert, para la tarea de clasificación de múltiples etiquetas.  

2. Se han obtenido y almacenado los pesos del modelo entrenado, mismo que serán utilizados en el tercer notebook del proyecto, con Bert como Servicio (Bert as a Service).  

3. La *Precisión porcentual promedio* obtenida (al tomar en cuenta `1` si se ha acertado en  que se haya acertado en el pronóstico de una etiqueta y `0` si no), es de `15.11%` la cual resulta muy baja.

4. En contraste, el *número de ocurrencias de las palabras pronosticadas* dentro del conjunto de etiquetas asignadas correspondientes a los registros, es en promedio de `1.83` (de las `3` utilizadas para el presente análisis).  

5. La anterior métrica, expresada de otra manera, indica que el conjunto de palabras pronosticadas, coincide en un `60.84%` con las etiquetas correspondientes al tópico principal del documento (aunque no necesariamente coinciden en el mismo orden).

6. Nos queda como tarea pendiente el revisar:  
    a) Si la manera de medir la precisión de las etiquetas pronosticadas es la adecuada.  
    b) Independientemente del punto anterior, determinar si la clasificación por múltiples etiquetas es la más adecuada en este caso (por ejemplo, podría ser más efectivo y conveniente clasificar por una sola clase/etiqueta -la más representativa de un tópico).  
    c) Efectuar las tareas necesarias para aumentar el vocabulario (archivo `vocab.txt`) del modelo. 