<a href="https://colab.research.google.com/github/StarSovu/AnimalCrossing/blob/master/text_generation_pipeline_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4 exercise

Based on "a brief example of how to run text generation with a causal language model and `pipeline`".

[transformers](https://huggingface.co/docs/transformers/index) python package should bee installed. This will be used to load the model and tokenizer and to run generation.

In [3]:
!pip install --quiet transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m81.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
[?25h

Import the `AutoTokenizer`, `AutoModelForCausalLM`, and `pipeline` classes. The first two support loading tokenizers and generative models from the [Hugging Face repository](https://huggingface.co/models), and the last wraps a tokenizer and a model for convenience.

In [4]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

Load a generative model and its tokenizer. You can substitute any other generative model name here (e.g. [other TurkuNLP GPT-3 models](https://huggingface.co/models?sort=downloads&search=turkunlp%2Fgpt3)), but note that Colab may have issues running larger models.

In [5]:
MODEL_NAME = 'gpt2-large'

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

Downloading (…)lve/main/config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Instantiate a text generation pipeline using the tokenizer and model.

In [6]:
pipe = pipeline(
    'text-generation',
    model=model,
    tokenizer=tokenizer,
    device=model.device
)

We can now call the pipeline with a text prompt; it will take care of tokenizing, encoding, generation, and decoding:

In [7]:
output = pipe('Identify the capital cities of countries. Question: What is the capital of Finland? Answer:', max_new_tokens=25)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Identify the capital cities of countries. Question: What is the capital of Finland? Answer: Helsinki.\n\nQuestion: What is the capital (or capitals) of France? Answer: Paris (Côte d'}]


In [8]:
print(output[0]['generated_text'])

Identify the capital cities of countries. Question: What is the capital of Finland? Answer: Helsinki.

Question: What is the capital (or capitals) of France? Answer: Paris (Côte d


Zero shot worked quite well in this case.

In [9]:
output = pipe('Identify the capital cities of countries. Question: What is the capital of Sweden? Answer: Stockholm Question: What is the capital of Finland? Answer:', max_new_tokens=25)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Identify the capital cities of countries. Question: What is the capital of Sweden? Answer: Stockholm Question: What is the capital of Finland? Answer: Helsinki Question: What is the capital of Malta? Answer: Malta Question: What is the capital of the Kingdom of the Netherlands'}]


In [10]:
output = pipe('Identify the capital cities of countries. Question: What is the capital of Sweden? Answer: Stockholm Question: What is the capital of Denmark? Answer: Copenhagen Question: What is the capital of Finland? Answer:', max_new_tokens=25)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Identify the capital cities of countries. Question: What is the capital of Sweden? Answer: Stockholm Question: What is the capital of Denmark? Answer: Copenhagen Question: What is the capital of Finland? Answer: Helsinki Question: What is the capital of Norway? Answer: Oslo Question: What is the capital of Switzerland? Answer: Bern'}]


Two shot actually gave wrong answer to its own question on one of my earlier tries, but this time it seemed to work nicely.

#Binary sentiment classification

In [14]:
output = pipe('Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer:', max_new_tokens=25)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: I like you. Text: Do you like me? Answer: No. Answer: Do you like me? Text: I'}]


In [12]:
output = pipe('Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: Negative. Text: I really like you. Answer: Positive. Text: I hate you. Answer:', max_new_tokens=25)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: Negative. Text: I really like you. Answer: Positive. Text: I hate you. Answer: Neutral Text: I really hate you. Answer: Neutral Text: I really like you. Answer: Positive. Text: I'}]


In [13]:
output = pipe('Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: Negative. Text: I really like you. Answer: Positive. Text: I hate you. Answer: Negative. Text: I dislike cats. Answer: Negative. Text: I love to run. Answer: Positive. Text: I am fond of math. Answer:', max_new_tokens=25)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Do the following texts express a positive or negative sentiment? Text: I do not like you. Answer: Negative. Text: I really like you. Answer: Positive. Text: I hate you. Answer: Negative. Text: I dislike cats. Answer: Negative. Text: I love to run. Answer: Positive. Text: I am fond of math. Answer: Positive. Text: I don't care. Answer: Negative."}]


Zero shot did not work at all in finding classificating the text. My next trials were two shot and five shot. In those cases it tried to classificate, but in my opinion not very well.

#Person name recognition

In [15]:
output = pipe('List the person names occurring in the following texts. Text: Ladies and gentlemen, this is Mambo Number Five One, two, three, four, five Everybody in the car, so come on, let us ride To the liquor store around the corner The boys say they want some gin and juice But I really do not wanna Beer-bust like I had last week I must stay deep because talk is cheap I like Angela, Pamela, Sandra and Rita And as I continue, you know they getting sweeter (uh) So what can I do? I really beg you, my Lord To me is flirting is just like a sport Anything fly, it is all good, let me dump it Please set in the trumpet A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer:', max_new_tokens=25)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "List the person names occurring in the following texts. Text: Ladies and gentlemen, this is Mambo Number Five One, two, three, four, five Everybody in the car, so come on, let us ride To the liquor store around the corner The boys say they want some gin and juice But I really do not wanna Beer-bust like I had last week I must stay deep because talk is cheap I like Angela, Pamela, Sandra and Rita And as I continue, you know they getting sweeter (uh) So what can I do? I really beg you, my Lord To me is flirting is just like a sport Anything fly, it is all good, let me dump it Please set in the trumpet A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: You are the answer. For now I shall let her go So she cannot escape your spell

In [16]:
output = pipe('List the person names occurring in the following texts. Text: Ladies and gentlemen, this is Mambo Number Five One, two, three, four, five Everybody in the car, so come on, let us ride To the liquor store around the corner The boys say they want some gin and juice But I really do not wanna Beer-bust like I had last week I must stay deep because talk is cheap I like Angela, Pamela, Sandra and Rita And as I continue, you know they getting sweeter (uh) So what can I do? I really beg you, my Lord To me is flirting is just like a sport Anything fly, it is all good, let me dump it Please set in the trumpet A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: Angela', max_new_tokens=25)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'List the person names occurring in the following texts. Text: Ladies and gentlemen, this is Mambo Number Five One, two, three, four, five Everybody in the car, so come on, let us ride To the liquor store around the corner The boys say they want some gin and juice But I really do not wanna Beer-bust like I had last week I must stay deep because talk is cheap I like Angela, Pamela, Sandra and Rita And as I continue, you know they getting sweeter (uh) So what can I do? I really beg you, my Lord To me is flirting is just like a sport Anything fly, it is all good, let me dump it Please set in the trumpet A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: Angela, Pamela, Sandra and Rita are all women. A little bit of Monica by my si

In [17]:
output = pipe('List the person names occurring in the following texts. Text: Ladies and gentlemen, this is Mambo Number Five One, two, three, four, five Everybody in the car, so come on, let us ride To the liquor store around the corner The boys say they want some gin and juice But I really do not wanna Beer-bust like I had last week I must stay deep because talk is cheap I like Angela, Pamela, Sandra and Rita And as I continue, you know they getting sweeter (uh) So what can I do? I really beg you, my Lord To me is flirting is just like a sport Anything fly, it is all good, let me dump it Please set in the trumpet A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: Angela, Pamela', max_new_tokens=25)

print(output)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'List the person names occurring in the following texts. Text: Ladies and gentlemen, this is Mambo Number Five One, two, three, four, five Everybody in the car, so come on, let us ride To the liquor store around the corner The boys say they want some gin and juice But I really do not wanna Beer-bust like I had last week I must stay deep because talk is cheap I like Angela, Pamela, Sandra and Rita And as I continue, you know they getting sweeter (uh) So what can I do? I really beg you, my Lord To me is flirting is just like a sport Anything fly, it is all good, let me dump it Please set in the trumpet A little bit of Monica in my life A little bit of Erica by my side A little bit of Rita is all I need A little bit of Tina is what I see A little bit of Sandra in the sun A little bit of Mary all night long A little bit of Jessica, here I am A little bit of you makes me your man (ah). Answer: Angela, Pamela, Sandra and Rita\n\nAnswer: Mambo Number Nine Six, Two, Three, 

In zero shot, the output is not at all what was wanted. In one shot and two shot, it managed to list the ones that were originally listed consecutively in the text, but not what was after that.

#Two digit addition

In [21]:
output = pipe('This is first grade math exam. 12 + 12 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This is first grade math exam. 12 + 12 = 14. Math-test is really close to math-test, Communists should have been forced to change their exam.




In [22]:
output = pipe('This is first grade math exam. 12 + 12 = 24, 13 + 12 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This is first grade math exam. 12 + 12 = 24, 13 + 12 = 28, 4 + 12 = 9, 13 + 4 = 17, 1 + 12 = 3, 15 + 4 = 12


In [23]:
output = pipe('This is first grade math exam. 12 + 12 = 24, 13 + 12 = 25, 11 + 10 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


This is first grade math exam. 12 + 12 = 24, 13 + 12 = 25, 11 + 10 = 26, 10 + 8 = 27, 9 + 8 = 28, 10 + 8 = 30.


Example:




That didn't go well at all! Let's try it without the text

In [25]:
output = pipe('12 + 12 = 24, 13 + 12 =', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


12 + 12 = 24, 13 + 12 = 27, 14 + 12 = 30, 15 + 12 = 33, etc..... Once you understand the meaning of the variable


There is some inconsistency on how correct the answers are.

# Own case

In [26]:
output = pipe('Which country won the Eurovision Song Contest in the following years? 2006:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Which country won the Eurovision Song Contest in the following years? 2006: Russia 2011: Portugal


In [28]:
output = pipe('Which country won the Eurovision Song Contest in the following years? 2005: Greece, 2006:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Which country won the Eurovision Song Contest in the following years? 2005: Greece, 2006: Iceland, 2007: Spain, 2008: Portugal, 2009: Italy, 2010: Greece, 2011: Russia and 2012: Greece


In [29]:
output = pipe('Which country won the Eurovision Song Contest in the following years? 2004: Ukraine, 2005: Greece, 2006:', max_new_tokens=25)

print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Which country won the Eurovision Song Contest in the following years? 2004: Ukraine, 2005: Greece, 2006: Finland, 2007: Canada, 2008: Russia, 2009: Spain, 2010: Portugal.

Ukrainian




It manages to get the formatting, though the actual countries are wrong in the most cases. It still managed to get Finland for 2006 right in the second example.