# Ejemplo generación de texto usando Text2Text

 Usando la generación de texto Text2Text de libre acceso proveniente de [Hugging Face](https://huggingface.co/docs/transformers/model_doc/gpt2) responderemos a preguntas, resumiremos texto y traduciremos oraciones. Cabe resaltar que la mayoría de funciones están dedicadas al idioma inglés, si se desea adaptar a otro idioma es necesario realizar un ajuste e incluso utilizar un modelo específico.

El primer paso es extraer de la librería principal las herramientas necesarias.

In [1]:
from transformers import pipeline

## Carga del modelo

In [2]:
text2text = pipeline("text2text-generation")

No model was supplied, defaulted to google-t5/t5-base and revision a9723ea (https://huggingface.co/google-t5/t5-base).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Device set to use cpu


## Solución de preguntas a partir de contexto

El modelo analizará una oración y apartir de ella respondera una pregunta dada

In [8]:
text2text("question: ¿A quién da flores María? context: María le da flores a John")

[{'generated_text': 'John'}]

## Traducción de texto

In [21]:
text2text("translate English to French: Maria gives flowers to John")

[{'generated_text': 'Maria remet des fleurs à John'}]

## Resumen de texto

In [23]:
text2text("""summarize: Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language,
 in particular how to program computers to process and analyze large amounts of natural language data.""")

[{'generated_text': 'natural language processing (NLP) is a subfield of linguistics, computer science,'}]

# Usando Transformadores específicos de HuggingFace

Observemos las diferencias al utilizar un modelo específico para las anteriores tareas text2text, en este caso utilizaremos el modelo **T5** para resumir texto.

In [24]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

In [25]:
model_name = 't5-small'
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


## Texto de entrada

In [26]:
input_text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language,
 in particular how to program computers to process and analyze large amounts of natural language data."""

## Preprocesamiento del texto

In [27]:
preprocess_text = input_text.strip().replace("\n", "")
t5_input_text = f"summarize: {preprocess_text}"

## Transformación del texto de entrada en Tokens para el modelo

In [28]:
tokenized_text = tokenizer.encode(t5_input_text, return_tensors="pt")

## Generación del resumen

In [29]:
summary_ids = model.generate(tokenized_text, num_beams=4, no_repeat_ngram_size=2, min_length=30, max_length=100, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)

Summary: natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence. it focuses on how to program computers to process and analyze large amounts of natural languages data.
