# Transformers and BERT

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Primero, importamos GPT2LMHeadModel para la generatción de texto y GPT2Tokenizer como tokenizer del texto.

In [5]:
from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch

A continuación, cargamos el tokenizer y se lo pasamos al modelo.

In [6]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer('Hello world')

{'input_ids': [101, 7592, 2088, 102], 'token_type_ids': [0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1]}

In [7]:
model = BertForMaskedLM.from_pretrained('bert-base-uncased',return_dict = True)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Después, para generar el texto, añadimos un primer texto a nuestro modelo y después a partir de él, generamos el texto. Antes de todo tenemos que preprocesar (tokenizar) ese primer texto que pasamos al modelo.


'pt' significa PyTorch Tensors

Con endode pasamos de texto a números y con decode pasamos de números a texto.

**ENCODE**

Ponemos el truncation a True porque a este tokenizer solo puede gestionar 512 tokens de una vez.

También podemos utilizar la función encode_plus que devuelve más información.

In [53]:
text = "Every Monday, Mary goes to the " + tokenizer.mask_token + " to relax."
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)

Para finalizar, generamos el texto a partir del modelo.

In [54]:
output = model(**input)

In [55]:
logits = output.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
   word = tokenizer.decode([token])
   new_sentence = text.replace(tokenizer.mask_token, word)
   print(new_sentence)

Every Monday, Mary goes to the beach to relax.
Every Monday, Mary goes to the spa to relax.
Every Monday, Mary goes to the hospital to relax.
Every Monday, Mary goes to the gym to relax.
Every Monday, Mary goes to the pool to relax.
Every Monday, Mary goes to the library to relax.
Every Monday, Mary goes to the hotel to relax.
Every Monday, Mary goes to the bathroom to relax.
Every Monday, Mary goes to the park to relax.
Every Monday, Mary goes to the house to relax.
