<a href="https://colab.research.google.com/github/StarSovu/AnimalCrossing/blob/master/text_generation_pipeline_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 exercise

Based on "a brief example of how to run text generation with a causal language model and `pipeline`".

[transformers](https://huggingface.co/docs/transformers/index) python package should bee installed. This will be used to load the model and tokenizer and to run generation.

In [2]:
!pip install --quiet transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m76.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m
[?25h

Import the `AutoTokenizer`, `AutoModelForCausalLM`, and `pipeline` classes. The first two support loading tokenizers and generative models from the [Hugging Face repository](https://huggingface.co/models), and the last wraps a tokenizer and a model for convenience.

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

In [4]:
MODEL_NAME = 'gpt2-large'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Instantiate a text generation pipeline using the tokenizer and model.

In [5]:
pipe = pipeline(
    'text-generation',
    model=model,
    tokenizer=tokenizer,
    device=model.device
)

We can now call the pipeline with a text prompt; it will take care of tokenizing, encoding, generation, and decoding:

In [6]:
output = pipe('Identify the capital cities of countries. Question: What is the capital of Finland? Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Identify the capital cities of countries. Question: What is the capital of Finland? Answer: Tall benefit from low precipitation. Question: What is the capitalcalculation of Estonia? Answer: Estonia has an average precipitation of


Zero shot worked quite well in this case sometimes, but sometimes answers are completely wrong.

In [8]:
output = pipe('Identify the capital cities of countries. Question: What is the capital of Sweden? Answer: Stockholm Question: What is the capital of Finland? Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Identify the capital cities of countries. Question: What is the capital of Sweden? Answer: Stockholm Question: What is the capital of Finland? Answer: Helsinki Question: What is the capital of Estonia? Answer: Tallinn Question: What is the capital of Latvia? Answer:


In [9]:
output = pipe('Identify the capital cities of countries. Question: What is the capital of Sweden? Answer: Stockholm Question: What is the capital of Denmark? Answer: Copenhagen Question: What is the capital of Finland? Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Identify the capital cities of countries. Question: What is the capital of Sweden? Answer: Stockholm Question: What is the capital of Denmark? Answer: Copenhagen Question: What is the capital of Finland? Answer: Helsinki Question: What is the capital of the Netherlands? Answer: The Hague Answer: Amsterdam Question: What is the capital of


Two shot actually gave wrong answer to its own question.

#Binary sentiment classification

In [10]:
output = pipe('Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: I do not like you.

Text: I agree to disagree. Answers: I agree to disagree.

Text


In [11]:
output = pipe('Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: Negative. Text: I really like you. Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: Negative. Text: I really like you. Answer: Positive. Text: I really like you. Answer: Negative. Text: I really don't like you. Answer: Negative


In [12]:
output = pipe('Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: Negative. Text: I really like you. Answer: Positive. Text: I hate you. Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: Negative. Text: I really like you. Answer: Positive. Text: I hate you. Answer: Positive. Text: I like you. Answer: Positive. Text: I don't like you. Answer: Negative.



In [14]:
output = pipe('Do the following texts express a positive or negative sentiment? Text: I dislike cats. Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Do the following texts express a positive or negative sentiment? Text: I dislike cats. Answer: I dislike cats. Text: I enjoy being on my own. Answer: I enjoy being on my own. Text: I


In [15]:
output = pipe('Do the following texts express a positive or negative sentiment? Text: I dislike cats. Answer: Negative. Text: I love to run. Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Do the following texts express a positive or negative sentiment? Text: I dislike cats. Answer: Negative. Text: I love to run. Answer: Positive. Text: Do you like to have a friend over? Answer: Positive. Text: I am not sure about this


In [16]:
output = pipe('Do the following texts express a positive or negative sentiment? Text: I dislike cats. Answer: Negative. Text: I love to run. Answer: Positive. Text: I am fond of math. Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Do the following texts express a positive or negative sentiment? Text: I dislike cats. Answer: Negative. Text: I love to run. Answer: Positive. Text: I am fond of math. Answer: Positive. Text: I love to listen to bands. Answer: Positive. Text: I find sports interesting. Answer: Positive


Zero shot did not work at all in finding classificating the text. In the others, it did manage to answer "Negative" or "Positive", but not always in the way it is supposed to be.

#Person name recognition

In [20]:
output = pipe('List the person names occurring in the following texts. Text: A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


List the person names occurring in the following texts. Text: A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: A beautiful woman named Sue.

The "Sue", of course, is Sue Marsteller and not so long


In [21]:
output = pipe('List the person names occurring in the following texts. Text: A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: Monica', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


List the person names occurring in the following texts. Text: A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: Monica

Lest you forget, Monica is from the original film version of the song (which I love but are somewhat inconsistent


In [22]:
output = pipe('List the person names occurring in the following texts. Text: A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: Monica, Erica', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


List the person names occurring in the following texts. Text: A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: Monica, Erica and Rita are good ones. A little of Jessica is all I need. A little of Maria is what I see. Answer


Zero shot and one shot were not at all what was wanted. Two shot did manage to add one name, but did not continue the list.

In [23]:
output = pipe('List the person names occurring in the following texts. Text: Mary had a little lamb, Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


List the person names occurring in the following texts. Text: Mary had a little lamb, Answer: Mary was not a virgin. Source: Matthew 2:1-3, 13, 27-28; Mark 6:2


In [24]:
output = pipe('List the person names occurring in the following texts. Text: Mary had a little lamb, Answer: Mary, Text: Frank had a little cow, Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


List the person names occurring in the following texts. Text: Mary had a little lamb, Answer: Mary, Text: Frank had a little cow, Answer: Frank, Text: Jacob had a little goat, Answer: Jacob, Text: Joseph also had a lamb, Answer: Joseph


In [26]:
output = pipe('List the person names occurring in the following texts. Text: Mary had a little lamb, Answer: Mary, Text: Frank had a little cow, Answer: Frank, Text: Then there was Harry, Answer:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


List the person names occurring in the following texts. Text: Mary had a little lamb, Answer: Mary, Text: Frank had a little cow, Answer: Frank, Text: Then there was Harry, Answer: Harry, Text: And there was Ginny, Answer: Ginny, Text: A few days after that, they found Harry on


This one may be a better example of true one shot and two shot cases. It did manage to accurately detect the names, even in the two shot case where the next example was not the first word when all of the previous ones were.

#Two digit addition

In [None]:
output = pipe('This is first grade math exam. 12 + 12 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This is first grade math exam. 12 + 12 = 14. Math-test is really close to math-test, Communists should have been forced to change their exam.




In [None]:
output = pipe('This is first grade math exam. 12 + 12 = 24, 13 + 12 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This is first grade math exam. 12 + 12 = 24, 13 + 12 = 28, 4 + 12 = 9, 13 + 4 = 17, 1 + 12 = 3, 15 + 4 = 12


In [None]:
output = pipe('This is first grade math exam. 12 + 12 = 24, 13 + 12 = 25, 11 + 10 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This is first grade math exam. 12 + 12 = 24, 13 + 12 = 25, 11 + 10 = 26, 10 + 8 = 27, 9 + 8 = 28, 10 + 8 = 30.


Example:




That didn't go well at all! Let's try it without the text.

In [27]:
output = pipe('44 + 12 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


44 + 12 = 12.0

= 36 = 15.0

= 4 = 6.0

= 8 = 9


In [28]:
output = pipe('44 + 12 = 56, 25 + 26 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


44 + 12 = 56, 25 + 26 = 60, 29 + 30 = 65) So the total gain of your current character has increased from -26 to +27 points


In [29]:
output = pipe('44 + 12 = 56, 25 + 26 = 51, 42 + 39 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


44 + 12 = 56, 25 + 26 = 51, 42 + 39 = 41, 42 + 47 = 38, 42 + 54 = 37, 43 + 37 = 32, 43 + 33 = 29


The answers were mostly incorrect.

# Own case

In [None]:
output = pipe('Which country won the Eurovision Song Contest in the following years? 2006:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Which country won the Eurovision Song Contest in the following years? 2006: Russia 2011: Portugal


In [None]:
output = pipe('Which country won the Eurovision Song Contest in the following years? 2005: Greece, 2006:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Which country won the Eurovision Song Contest in the following years? 2005: Greece, 2006: Iceland, 2007: Spain, 2008: Portugal, 2009: Italy, 2010: Greece, 2011: Russia and 2012: Greece


In [None]:
output = pipe('Which country won the Eurovision Song Contest in the following years? 2004: Ukraine, 2005: Greece, 2006:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Which country won the Eurovision Song Contest in the following years? 2004: Ukraine, 2005: Greece, 2006: Finland, 2007: Canada, 2008: Russia, 2009: Spain, 2010: Portugal.

Ukrainian




It manages to get the formatting, though the actual countries are wrong in the most cases. It still managed to get Finland for 2006 right in the second example.