<h1>Ejemplos de Transformers</h1>

- Como vimos, transformers es útil para el procesamiento de palabras. 🤗 Hugging Face es una empresa y una comunidad muy influyente en el campo del procesamiento de lenguaje natural (NLP) y el aprendizaje automático. Su misión principal es hacer que la tecnología de aprendizaje profundo (Deep Learning) sea accesible y útil para todos, proporcionando herramientas y modelos preentrenados que simplifican el desarrollo de aplicaciones de inteligencia artificial.

- La principal contribución de Hugging Face a la comunidad de Machine Learning es su librería Transformers. Esta librería proporciona una interfaz fácil de usar para trabajar con modelos Transformer preentrenados, como BERT, GPT, T5, DistilBERT, y muchos otros, que están diseñados para tareas de procesamiento de lenguaje natural (NLP).

- Con Transformers, es posible cargar modelos preentrenados y utilizarlos para una variedad de tareas, como:
  * Clasificación de texto.
  * Traducción de idiomas.
  * Generación de texto.
  * Resumido de texto.
  * Análisis de sentimientos.
  * Pregunta y respuesta.


<h2>Instalamos transformers</h2>

In [5]:
%%capture --no-display
pip install transformers

In [6]:
%%capture --no-display
pip install ipywidgets --upgrade

In [7]:
%%capture --no-display
pip install tqdm --upgrade

<h2>Ejemplo de clasificación de texto y análisis de sentimientos.</h2>

<h3>Importamos modelo y tokenizer. Elegimos que se predecirá</h3>

In [8]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification #Importa tokenizer y modelo pre-entrenado
#Este modelo trata de clasificar el texto de acuerdo a si una oración en inglés es considerada como un sentimiento positivo o negativo

#Se genera el tokenizer y el modelp
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

In [9]:
#Se agrega un elemento a predecir. return_tensors sirve para especificar el tipo de tensor que va a rgresar, en este caso un tensor de pytorch
inputs = tokenizer("I am so happy right now", return_tensors="pt")

In [41]:
inputs

{'input_ids': tensor([[  101,  2054,  2003, 17662,  2227,  2124,  2005,  1029,   102, 17662,
          2227,  2003,  1037,  2194,  2008,  3640,  2019,  2330,  1011,  3120,
          4132,  2005,  3019,  2653,  6364,  1006, 17953,  2361,  1007,  8518,
          1012,  2037,  4132,  4107,  3653,  1011,  4738,  4275,  1010,  2951,
         13462,  2015,  1010,  1998,  5906,  2000,  2191,  2009,  6082,  2005,
          9797,  2000,  3857, 17953,  2361,  5097,  1012,  2027,  2024,  2092,
          2124,  2005,  2037, 10938,  2121,  4275,  2066, 14324,  1010, 14246,
          2102,  1011,  1016,  1010,  1056,  2629,  1010,  1998,  2500,  1010,
          2029,  2031,  2042,  4235,  4233,  1999,  1996,  9932,  2451,  1012,
           102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

<h3>Predecimos y vemos la salida</h3>

In [10]:
#Al momento de ingresar el input automaticamente se vuelve a calcular el gradiente, pero con la siguiente instrucción evita esto para mejorar
#rendimiento
with torch.no_grad():
    logits = model(**inputs)##se pasa **inputs pues es una manera de pasar de un diccionario su llave y objeto a las funciones
logits

SequenceClassifierOutput(loss=None, logits=tensor([[-4.3311,  4.6930]]), hidden_states=None, attentions=None)

In [11]:
with torch.no_grad():
    logits = model(**inputs).logits #elegimos .logits seleccionar los puntajes, en la primer posicion del tensor es puntaje para negativos
                                    # y la segunda para positivos
predicted_class_id = logits.argmax().item() #Se selleciona la clase con mayor puntaje
model.config.id2label[predicted_class_id] #Muestra la clase predicha por medio de un diccionario

'POSITIVE'

In [12]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

<h3>Existen distintos modelos, por ejemplo hay en español</h3>

In [50]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Cargar el modelo y el tokenizer multilingüe para clasificación de sentimientos
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

# Texto en español
texto = ":c "

# Tokenizar el texto
inputs = tokenizer(texto, return_tensors="pt")

# Realizar inferencia sin calcular gradientes
with torch.no_grad():
    logits = model(**inputs).logits

# Obtener la predicción de la clase
predicted_class_id = logits.argmax().item()

# Las etiquetas de sentimiento en este modelo son de 0 (muy negativo) a 4 (muy positivo)
etiquetas = ["muy negativo", "negativo", "neutral", "positivo", "muy positivo"]
print(f"El sentimiento del texto es: {etiquetas[predicted_class_id]}")


El sentimiento del texto es: neutral


<h2>Es posible importar la arquitectura deseada, por ejemplo, de clasificación y entrenar con datos propios</h2>

<h3>Usaremos la base de datos de IMDb</h3>

In [14]:
%%capture --no-display
pip install transformers datasets torch


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [15]:
%%capture --no-display
pip install tf-keras

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [16]:
%%capture --no-display
pip install transformers[torch]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [17]:
%%capture --no-display
pip install 'accelerate>=0.26.0'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


<h3>Importamos los modulos</h3>

In [18]:
import torch
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


<h3>Cargamos la base de datos</h3>

In [20]:
# Cargar el dataset
dataset = load_dataset("imdb")


In [21]:
dataset['train']['text'][99]

"This film is terrible. You don't really need to read this review further. If you are planning on watching it, suffice to say - don't (unless you are studying how not to make a good movie).<br /><br />The acting is horrendous... serious amateur hour. Throughout the movie I thought that it was interesting that they found someone who speaks and looks like Michael Madsen, only to find out that it is actually him! A new low even for him!!<br /><br />The plot is terrible. People who claim that it is original or good have probably never seen a decent movie before. Even by the standard of Hollywood action flicks, this is a terrible movie.<br /><br />Don't watch it!!! Go for a jog instead - at least you won't feel like killing yourself."

<h3>Iniciamos el tokenizer y lo aplicamos a los datos</h3>

In [22]:
# Tokenización
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

In [23]:
# Dividir en entrenamiento y validación (si no existe una división)
train_dataset = tokenized_datasets['train']
val_dataset = tokenized_datasets['test']

<h3>Inicializamos el modelo con los parámetros deseados</h3>

In [31]:
# Configuración del modelo y argumentos de entrenamiento
imdb_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

training_args = TrainingArguments(
    output_dir='./results',          # Directorio donde se guardarán los resultados
    evaluation_strategy="epoch",     # Evaluar por época
    learning_rate=2e-5,              # Tasa de aprendizaje
    per_device_train_batch_size=8,   # Tamaño del batch
    per_device_eval_batch_size=8,    # Tamaño del batch para evaluación
    num_train_epochs=3,              # Número de épocas
    weight_decay=0.01,               # Decaimiento de peso
)

# Función de métricas (precisión)
def compute_metrics(p):
    predictions, labels = p
    preds = predictions.argmax(axis=-1)
    return {"accuracy": accuracy_score(labels, preds)}


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


<h3>Entrenamos, esto puede tardar cerca de una hora</h3>

In [32]:

# Entrenador
trainer = Trainer(
    model=imdb_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# Entrenar el modelo
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2625,0.258317,0.92088
2,0.1602,0.263804,0.93712
3,0.0838,0.32521,0.93904


TrainOutput(global_step=9375, training_loss=0.17759403198242188, metrics={'train_runtime': 4535.0153, 'train_samples_per_second': 16.538, 'train_steps_per_second': 2.067, 'total_flos': 1.9733329152e+16, 'train_loss': 0.17759403198242188, 'epoch': 3.0})

<h3>Evaluamos los resultados con los datos de test</h3>

In [33]:
results = trainer.evaluate()
print(results)

{'eval_loss': 0.32521024346351624, 'eval_accuracy': 0.93904, 'eval_runtime': 383.164, 'eval_samples_per_second': 65.246, 'eval_steps_per_second': 8.156, 'epoch': 3.0}


<h3>Realizamos predicciones</h3>

In [53]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Verifica si tienes una GPU disponible, esto pues parece que el modelo y el input se pueden encontrar en lugares diferentes
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Cargar el modelo y el tokenizador
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
imdb_model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Mover el modelo al dispositivo
imdb_model.to(device)

# Define el texto para predecir
text = "I love this movie"

# Tokenizar el texto
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)

# Mover los tensores de entrada al mismo dispositivo que el modelo
inputs = {key: value.to(device) for key, value in inputs.items()}

# Realizar la predicción
with torch.no_grad():  # No se necesita gradiente para la predicción
    outputs = imdb_model(**inputs)

# Obtener la predicción
prediction = torch.argmax(outputs.logits, dim=-1)

# Imprimir la predicción
print(f"Predicción: {prediction.item()}")  # 0 = negativo, 1 = positivo


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Predicción: 1


<h2>Traducción de idiomas</h2>

In [35]:
%%capture --no-displayr
pip install sentencepiece

UsageError: unrecognized arguments: --no-displayr


In [36]:
%%capture --no-display
pip install sacremoses

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [54]:
from transformers import MarianMTModel, MarianTokenizer

# Cargar el modelo y el tokenizador preentrenado para traducción inglés -> español
model_name = "Helsinki-NLP/opus-mt-en-es"  # Traducción de inglés a español
translate_model = MarianMTModel.from_pretrained(model_name)
tokenizer = MarianTokenizer.from_pretrained(model_name)

# Texto en inglés que se quiere traducir
text = "i live in that house"

# Tokenizar el texto de entrada
inputs = tokenizer(text, return_tensors="pt", padding=True)

# Realizar la traducción
translated = translate_model.generate(**inputs)

# Decodificar la traducción de vuelta a texto
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

# Mostrar el texto traducido
print(f"Texto original: {text}")
print(f"Texto traducido: {translated_text}")

Texto original: i live in that house
Texto traducido: Vivo en esa casa.


<h2>Generar texto</h2>

In [55]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Cargar el modelo preentrenado y el tokenizador
model_name = "gpt2"  # Usamos el modelo base de GPT-2
generative_model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Asegurarse de que el modelo esté en el dispositivo adecuado (CPU o GPU)
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
generative_model.to(device)

# Texto de entrada (prompt)
prompt = "Today its wendsday an tomorrow"

# Tokenizar el prompt
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generar texto a partir del prompt
output = generative_model.generate(
    **inputs,
    max_length=100,       # Longitud máxima de la secuencia generada
    num_return_sequences=1,  # Número de secuencias a generar
    no_repeat_ngram_size=2,  # Evita la repetición de n-grams
    temperature=0.7,      # Controla la aleatoriedad del modelo
    top_k=50,             # Limita a las top k probabilidades
    top_p=0.95,           # Usar muestreo nuclear (probabilidades acumuladas)
    do_sample=True        # Usar muestreo en lugar de decodificación codiciosa
)

# Decodificar la salida generada
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Mostrar el texto generado
print(generated_text)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Today its wendsday an tomorrow's a-day, and the way it's structured, it is the first time it has done so."

The president is expected to make the announcement on Monday, but no date has been set.
.@JPMorgan will be announcing this morning. — Robert Costa (@costareports) October 30, 2015
, "It's been two days since the Trump administration announced a new program to train and equip American military and intelligence personnel in the


<h2>Resumir texto</h2>

In [39]:
from transformers import BartForConditionalGeneration, BartTokenizer

# Cargar el modelo y el tokenizador preentrenado para resumen
model_name = "facebook/bart-large-cnn"  # Modelo de BART preentrenado para resumen
resum_model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

# Texto que se desea resumir
text = """
Bioinformatics, as related to genetics and genomics, is a scientific subdiscipline that involves using computer technology to collect, store, analyze and disseminate biological data and information, such as DNA and amino acid sequences or annotations about those sequences.
"""

# Tokenizar el texto de entrada
inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)

# Generar el resumen
summary_ids = resum_model.generate(
    inputs["input_ids"], 
    max_length=150,  # Longitud máxima del resumen
    min_length=50,   # Longitud mínima del resumen
    length_penalty=2.0,  # Penaliza resúmenes demasiado cortos
    num_beams=4,     # Número de "haz" para la búsqueda de secuencias
    early_stopping=True
)

# Decodificar el resumen generado
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Mostrar el resumen
print(f"Resumen generado: {summary}")


Resumen generado: Bioinformatics is a subdiscipline of genetics and genomics. It involves using computer technology to collect, store, analyze and disseminate biological data and information. It can include DNA and amino acid sequences or annotations about those sequences.


<h2>Responder preguntas</h2>

In [40]:
from transformers import BertForQuestionAnswering, BertTokenizer
import torch

# Cargar el modelo y el tokenizador preentrenado para preguntas y respuestas
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"
respond_model = BertForQuestionAnswering.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Texto que se utiliza como base para responder preguntas
context = """
Hugging Face is a company that provides an open-source platform for natural language processing (NLP) tasks. 
Their platform offers pre-trained models, datasets, and tools to make it easier for developers to build NLP applications.
They are well known for their transformer models like BERT, GPT-2, T5, and others, which have been widely adopted in the AI community.
"""

# Pregunta que se quiere responder
question = "What is Hugging Face known for?"

# Tokenizar el texto de entrada (contexto y pregunta)
inputs = tokenizer.encode_plus(question, context, add_special_tokens=True, return_tensors="pt")

# Obtener los tensores de entrada
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]

# Realizar la predicción (responder la pregunta)
outputs = respond_model(input_ids=input_ids, attention_mask=attention_mask)

# Obtener los índices de las respuestas (start y end)
start_index = torch.argmax(outputs.start_logits)
end_index = torch.argmax(outputs.end_logits)

# Convertir los índices de respuesta a texto
answer_ids = input_ids[0][start_index:end_index + 1]
answer = tokenizer.decode(answer_ids)

# Mostrar la respuesta
print(f"Pregunta: {question}")
print(f"Respuesta: {answer}")


Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Pregunta: What is Hugging Face known for?
Respuesta: their transformer models
