## **Transformers**

In [21]:
pip install transformers==4.9.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [37]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 8.3 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


## **GPT-2 para generación de texto**

Se carga el modelo

In [31]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')

Ejemplo 1:

In [32]:
set_seed(123)
generator("Hey readers, today is",
max_length=20,
num_return_sequences=4)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hey readers, today is not your only chance to learn more about what's happening around the Web and"},
 {'generated_text': 'Hey readers, today is Christmas. You will be able to access a brand new edition of the New'},
 {'generated_text': "Hey readers, today is C++14 Day! It's great. We have a couple of changes"},
 {'generated_text': "Hey readers, today is the day! If you're not ready, a lot of you on the"}]

In [33]:
generator("Hey enemies, today is not",
max_length=20,
num_return_sequences=4)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Hey enemies, today is not an opportunity for many people to celebrate the success of the first year of'},
 {'generated_text': "Hey enemies, today is not easy!\n\nWhen you're with the boss at your team's"},
 {'generated_text': 'Hey enemies, today is not a chance to forget..." "But what if I told you I do'},
 {'generated_text': 'Hey enemies, today is not a nice day. Let it be clear."\n\nAs it turns'}]

El siguiente código muestra como se tokeniza una frase, es decir, la codifica a un formato del modelo GPT-2

In [None]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
text = "Let us encode this sentence please, us"
encoded_input = tokenizer(text, return_tensors='pt')
encoded_input

{'input_ids': tensor([[ 5756,   514, 37773,   428,  6827,  3387,    11,   514]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
from transformers import GPT2Model
model = GPT2Model.from_pretrained('gpt2')
output = model(**encoded_input)
output['last_hidden_state'].shape

torch.Size([1, 6, 768])

## **Question Answering**

In [34]:
qa_pipeline = pipeline("question-answering")

Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/261M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [27]:
context = """Machine learning (ML) is the study of computer algorithms that improve automatically through experience. 
It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", 
in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks."""

In [35]:
question = "What are machine learning models based on?"
result = qa_pipeline(question=question, context=context)
print("Answer:", result['answer'])

Answer: sample data


## **Traducción**

In [3]:
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

In [2]:
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

Downloading:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

In [12]:
en_text="hello my friend"
tokenizer.src_lang = "en"
encoded_en = tokenizer(en_text, return_tensors="pt")

generated_tokens = model.generate(**encoded_en, forced_bos_token_id=tokenizer.get_lang_id("es"))
print("Español: ", tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))

generated_tokens2 = model.generate(**encoded_en, forced_bos_token_id=tokenizer.get_lang_id("fr"))
print("Francés: ",tokenizer.batch_decode(generated_tokens2, skip_special_tokens=True))

generated_tokens3 = model.generate(**encoded_en, forced_bos_token_id=tokenizer.get_lang_id("ru"))
print("Ruso: ",tokenizer.batch_decode(generated_tokens3, skip_special_tokens=True))

Español:  ['Hola mi amigo']
Francés:  ['Bonjour mon ami']
Ruso:  ['Здравствуйте мой друг']
