# Language modelling

The exercise shows how a language model may be used to solve word-prediction tasks and to generate text.

## Tasks

Objectives (8 points):

**Imports**

In [1]:
from transformers import pipeline
import pandas as pd

1. Read the documentation of [Language modelling in the Transformers](https://huggingface.co/transformers/task_summary.html#language-modeling) library.

In [2]:
generator = pipeline(task="text-generation")
prompt = "Hello, It's me and I am testing this generator"
generator(prompt)

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.




model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development




All PyTorch model weights were used when initializing TFGPT2LMHeadModel.

All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, It's me and I am testing this generator on my computer. For about a day. I need a big hard drive with plenty of space for everything. I just need it to generate some code. To do this I can run this:"}]

In [4]:
text = "Hugging Face is a community-based open-source <mask> for machine learning."
fill_mask = pipeline(task="fill-mask")
preds = fill_mask(text, top_k=1)
preds = [
    {
        "score": round(pred["score"], 4),
        "token": pred["token"],
        "token_str": pred["token_str"],
        "sequence": pred["sequence"],
    }
    for pred in preds
]

preds

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFRobertaForMaskedLM.

All the weights of TFRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForMaskedLM for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'score': 0.2236,
  'token': 1761,
  'token_str': ' platform',
  'sequence': 'Hugging Face is a community-based open-source platform for machine learning.'}]

2. Download three [Polish models](https://huggingface.co/models?filter=pl) from the Huggingface repository. These should be regular language models, which were not fine-tuned. E.g. `HerBERT` and `papuGaPT2` are good examples. You can also try using Bielik for that, but make sure you are using the model via Transformers API, not GUI.

In [5]:
bert = pipeline('fill-mask', model='bert-base-multilingual-cased')
distilbert = pipeline('fill-mask', model='distilbert-base-multilingual-cased')
roberta = pipeline("fill-mask", model='xlm-roberta-base')

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/714M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/1.12G [00:02<?, ?B/s]

All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [6]:
models = [
    'bert-base-multilingual-cased',
    'distilbert-base-multilingual-cased',
    'xlm-roberta-base'
]

**BERT Base Multilingual Cased** is a transformer model pretrained on text from 104 languages, primarily Wikipedia. It excels in tasks like text classification and question answering, using masked language modeling and next sentence prediction. Its case-sensitive nature allows it to differentiate between capitalized and lowercase words.

**DistilBERT Base Multilingual Cased** is a lighter and faster version of BERT, retaining 97% of its language understanding capabilities while being 60% faster and more memory-efficient. It is also pretrained on a multilingual corpus and is effective for various NLP tasks, preserving case distinctions.

**XLM-RoBERTa Base** is built on the RoBERTa architecture and trained on 100 languages. It uses advanced training techniques to enhance performance, especially for low-resource languages. This model is particularly strong in cross-lingual tasks and translation.

3. Devise a method to test if the langage model understands Polish cases. E.g. testing for *nominal case* could be expressed as "Warszawa to największe `[MASK]`", and the masked word should be in nominative case. Create sentences for each case.

In [9]:
def fill_mask_for_model(text, model_name: str, k = 5):
    results = []
    
    if model_name == 'xlm-roberta-base':
        text = text.replace('[MASK]', '<mask>')
    
    model = pipeline('fill-mask', model=model_name)
    preds = model(text, top_k=k)

    for pred in preds:
        result = {
            "model": model_name,
            "score": round(pred["score"], 2),
            "token": pred["token_str"],
            "sequence": pred["sequence"]
        }
        results.append(result)
    
    return results

In [10]:
def fill_mask_for_all_models_and_return_pandas(text, k = 5):
    result_list = []

    for model in models:
        result_list += fill_mask_for_model(text, model, k)
    
    return pd.DataFrame(result_list)

In [11]:
df_bert_base = pd.DataFrame(fill_mask_for_model('Warszawa to największe [MASK]', models[0]))

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.


In [12]:
df_bert_base

Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.62,.,Warszawa to największe.
1,bert-base-multilingual-cased,0.16,miasto,Warszawa to największe miasto
2,bert-base-multilingual-cased,0.03,Miasto,Warszawa to największe Miasto
3,bert-base-multilingual-cased,0.02,miasta,Warszawa to największe miasta
4,bert-base-multilingual-cased,0.01,:,Warszawa to największe :


In [13]:
df_distilbert = pd.DataFrame(fill_mask_for_model('Warszawa to największe [MASK]', models[1]))

All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.


In [14]:
df_distilbert

Unnamed: 0,model,score,token,sequence
0,distilbert-base-multilingual-cased,0.56,miasto,Warszawa to największe miasto
1,distilbert-base-multilingual-cased,0.1,miasta,Warszawa to największe miasta
2,distilbert-base-multilingual-cased,0.02,Miasto,Warszawa to największe Miasto
3,distilbert-base-multilingual-cased,0.01,.,Warszawa to największe.
4,distilbert-base-multilingual-cased,0.01,##sza,Warszawa to największesza


In [15]:
df_roberta = pd.DataFrame(fill_mask_for_model('Warszawa to największe [MASK]', models[2]))

All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


In [16]:
df_roberta

Unnamed: 0,model,score,token,sequence
0,xlm-roberta-base,0.3,...,Warszawa to największe...
1,xlm-roberta-base,0.16,miasto,Warszawa to największe miasto
2,xlm-roberta-base,0.07,!,Warszawa to największe!
3,xlm-roberta-base,0.05,...,Warszawa to największe ...
4,xlm-roberta-base,0.05,city,Warszawa to największe city


4. Devise a method to test long-range relationships such as gender. E.e. you can use two verbs with masculine and feminine gender, where one of the verbs is masked. Both verbs should have the same gender, assuming the subject is the same. Define at least 3 such sentences.

In [17]:
fill_mask_for_all_models_and_return_pandas('Ja tak ciężko pracowałam, a on [MASK] to zrobić.')

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.35,sam,"Ja tak ciężko pracowałam, a on sam to zrobić."
1,bert-base-multilingual-cased,0.07,może,"Ja tak ciężko pracowałam, a on może to zrobić."
2,bert-base-multilingual-cased,0.04,to,"Ja tak ciężko pracowałam, a on to to zrobić."
3,bert-base-multilingual-cased,0.04,tak,"Ja tak ciężko pracowałam, a on tak to zrobić."
4,bert-base-multilingual-cased,0.04,by,"Ja tak ciężko pracowałam, a on by to zrobić."
5,distilbert-base-multilingual-cased,0.13,miał,"Ja tak ciężko pracowałam, a on miał to zrobić."
6,distilbert-base-multilingual-cased,0.05,ten,"Ja tak ciężko pracowałam, a on ten to zrobić."
7,distilbert-base-multilingual-cased,0.04,魂,"Ja tak ciężko pracowałam, a on 魂 to zrobić."
8,distilbert-base-multilingual-cased,0.03,sam,"Ja tak ciężko pracowałam, a on sam to zrobić."
9,distilbert-base-multilingual-cased,0.03,tylko,"Ja tak ciężko pracowałam, a on tylko to zrobić."


In [18]:
fill_mask_for_all_models_and_return_pandas('On nie potrafi tego używać i miał nadzieję że ja jemu [MASK].')

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.31,##dzi,On nie potrafi tego używać i miał nadzieję że ...
1,bert-base-multilingual-cased,0.12,##dzie,On nie potrafi tego używać i miał nadzieję że ...
2,bert-base-multilingual-cased,0.05,##ją,On nie potrafi tego używać i miał nadzieję że ...
3,bert-base-multilingual-cased,0.03,##ł,On nie potrafi tego używać i miał nadzieję że ...
4,bert-base-multilingual-cased,0.03,##dz,On nie potrafi tego używać i miał nadzieję że ...
5,distilbert-base-multilingual-cased,0.07,było,On nie potrafi tego używać i miał nadzieję że ...
6,distilbert-base-multilingual-cased,0.04,udało,On nie potrafi tego używać i miał nadzieję że ...
7,distilbert-base-multilingual-cased,0.03,miał,On nie potrafi tego używać i miał nadzieję że ...
8,distilbert-base-multilingual-cased,0.03,nie,On nie potrafi tego używać i miał nadzieję że ...
9,distilbert-base-multilingual-cased,0.02,prowadził,On nie potrafi tego używać i miał nadzieję że ...


In [19]:
fill_mask_for_all_models_and_return_pandas('Chciałam iść na zakupy, ale on się na to [MASK].')

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.05,nie,"Chciałam iść na zakupy, ale on się na to nie."
1,bert-base-multilingual-cased,0.03,##czy,"Chciałam iść na zakupy, ale on się na toczy."
2,bert-base-multilingual-cased,0.02,##r,"Chciałam iść na zakupy, ale on się na tor."
3,bert-base-multilingual-cased,0.02,##rze,"Chciałam iść na zakupy, ale on się na torze."
4,bert-base-multilingual-cased,0.01,sam,"Chciałam iść na zakupy, ale on się na to sam."
5,distilbert-base-multilingual-cased,0.13,stanowisko,"Chciałam iść na zakupy, ale on się na to stano..."
6,distilbert-base-multilingual-cased,0.06,##rze,"Chciałam iść na zakupy, ale on się na torze."
7,distilbert-base-multilingual-cased,0.02,czas,"Chciałam iść na zakupy, ale on się na to czas."
8,distilbert-base-multilingual-cased,0.02,tylko,"Chciałam iść na zakupy, ale on się na to tylko."
9,distilbert-base-multilingual-cased,0.02,nie,"Chciałam iść na zakupy, ale on się na to nie."


5. Check if the model captures real-world knolwedge. For instance a sentence "`[MASK]` wrze w temperaturze 100 stopni, a zamarza w temperaturze 0 stopni Celsjusza." checks if the model "knows" the description of water. Define at least 3 such sentences.

In [20]:
fill_mask_for_all_models_and_return_pandas('[MASK] wrze w temperaturze 100 stopni, a zamarza w temperaturze 0 stopni Celsjusza.')

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.04,Jego,"Jego wrze w temperaturze 100 stopni, a zamarza..."
1,bert-base-multilingual-cased,0.03,Za,"Za wrze w temperaturze 100 stopni, a zamarza w..."
2,bert-base-multilingual-cased,0.03,Po,"Po wrze w temperaturze 100 stopni, a zamarza w..."
3,bert-base-multilingual-cased,0.02,Na,"Na wrze w temperaturze 100 stopni, a zamarza w..."
4,bert-base-multilingual-cased,0.02,W,"W wrze w temperaturze 100 stopni, a zamarza w ..."
5,distilbert-base-multilingual-cased,0.09,Na,"Na wrze w temperaturze 100 stopni, a zamarza w..."
6,distilbert-base-multilingual-cased,0.05,We,"We wrze w temperaturze 100 stopni, a zamarza w..."
7,distilbert-base-multilingual-cased,0.03,Od,"Od wrze w temperaturze 100 stopni, a zamarza w..."
8,distilbert-base-multilingual-cased,0.03,W,"W wrze w temperaturze 100 stopni, a zamarza w ..."
9,distilbert-base-multilingual-cased,0.03,we,"we wrze w temperaturze 100 stopni, a zamarza w..."


In [21]:
fill_mask_for_all_models_and_return_pandas('[MASK] jest największą górą na świecie.')

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.07,.,. jest największą górą na świecie.
1,bert-base-multilingual-cased,0.06,to,to jest największą górą na świecie.
2,bert-base-multilingual-cased,0.05,To,To jest największą górą na świecie.
3,bert-base-multilingual-cased,0.04,Tonga,Tonga jest największą górą na świecie.
4,bert-base-multilingual-cased,0.04,Ta,Ta jest największą górą na świecie.
5,distilbert-base-multilingual-cased,0.06,Miasto,Miasto jest największą górą na świecie.
6,distilbert-base-multilingual-cased,0.04,Obecnie,Obecnie jest największą górą na świecie.
7,distilbert-base-multilingual-cased,0.03,popolasiù,popolasiù jest największą górą na świecie.
8,distilbert-base-multilingual-cased,0.03,Grupa,Grupa jest największą górą na świecie.
9,distilbert-base-multilingual-cased,0.02,Jest,Jest jest największą górą na świecie.


In [22]:
fill_mask_for_all_models_and_return_pandas('Samochody jeżdżą na [MASK]')

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.11,:,Samochody jeżdżą na :
1,bert-base-multilingual-cased,0.03,linii,Samochody jeżdżą na linii
2,bert-base-multilingual-cased,0.03,stacji,Samochody jeżdżą na stacji
3,bert-base-multilingual-cased,0.01,ulicy,Samochody jeżdżą na ulicy
4,bert-base-multilingual-cased,0.01,polu,Samochody jeżdżą na polu
5,distilbert-base-multilingual-cased,0.05,ziemi,Samochody jeżdżą na ziemi
6,distilbert-base-multilingual-cased,0.05,:,Samochody jeżdżą na :
7,distilbert-base-multilingual-cased,0.04,ёсць,Samochody jeżdżą na ёсць
8,distilbert-base-multilingual-cased,0.03,Ziemi,Samochody jeżdżą na Ziemi
9,distilbert-base-multilingual-cased,0.02,terenie,Samochody jeżdżą na terenie


6. Check zero-shot learning capabilites of the models. Provide at least 5 sentences with different sentiment for the following scheme: "'Ten film to był kiler. Nie mogłem się oderwać od ekranu.' Wypowiedź ta ma jest zdecydowanie `[MASK]`" Try different prompts, to see if they make any difference.

In [23]:
fill_mask_for_all_models_and_return_pandas('"Ten film to był kiler. Nie mogłem się oderwać od ekranu." Wypowiedź ta jest zdecydowanie [MASK].')

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.1,słowa,""" Ten film to był kiler. Nie mogłem się oderwa..."
1,bert-base-multilingual-cased,0.05,znana,""" Ten film to był kiler. Nie mogłem się oderwa..."
2,bert-base-multilingual-cased,0.03,ta,""" Ten film to był kiler. Nie mogłem się oderwa..."
3,bert-base-multilingual-cased,0.02,sama,""" Ten film to był kiler. Nie mogłem się oderwa..."
4,bert-base-multilingual-cased,0.02,to,""" Ten film to był kiler. Nie mogłem się oderwa..."
5,distilbert-base-multilingual-cased,0.24,filmu,""" Ten film to był kiler. Nie mogłem się oderwa..."
6,distilbert-base-multilingual-cased,0.09,znana,""" Ten film to był kiler. Nie mogłem się oderwa..."
7,distilbert-base-multilingual-cased,0.04,serialu,""" Ten film to był kiler. Nie mogłem się oderwa..."
8,distilbert-base-multilingual-cased,0.02,Tuttavia,""" Ten film to był kiler. Nie mogłem się oderwa..."
9,distilbert-base-multilingual-cased,0.01,Dakota,""" Ten film to był kiler. Nie mogłem się oderwa..."


In [24]:
fill_mask_for_all_models_and_return_pandas('"Nienawidzę polityki." Zdanie to ma wydźwięk [MASK].')

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.09,"""",""" Nienawidzę polityki. "" Zdanie to ma wydźwięk ""."
1,bert-base-multilingual-cased,0.02,w,""" Nienawidzę polityki. "" Zdanie to ma wydźwięk w."
2,bert-base-multilingual-cased,0.02,pt,""" Nienawidzę polityki. "" Zdanie to ma wydźwięk..."
3,bert-base-multilingual-cased,0.02,tzw,""" Nienawidzę polityki. "" Zdanie to ma wydźwięk..."
4,bert-base-multilingual-cased,0.02,tekstu,""" Nienawidzę polityki. "" Zdanie to ma wydźwięk..."
5,distilbert-base-multilingual-cased,0.17,勅,""" Nienawidzę polityki. "" Zdanie to ma wydźwięk 勅."
6,distilbert-base-multilingual-cased,0.08,pracy,""" Nienawidzę polityki. "" Zdanie to ma wydźwięk..."
7,distilbert-base-multilingual-cased,0.04,władzy,""" Nienawidzę polityki. "" Zdanie to ma wydźwięk..."
8,distilbert-base-multilingual-cased,0.04,człowieka,""" Nienawidzę polityki. "" Zdanie to ma wydźwięk..."
9,distilbert-base-multilingual-cased,0.02,prawa,""" Nienawidzę polityki. "" Zdanie to ma wydźwięk..."


In [25]:
fill_mask_for_all_models_and_return_pandas('"Staram się być obiektywyny niezależnie co się dzieje." Wypowiedź ta jest zdecydowanie [MASK].')

All PyTorch model weights were used when initializing TFBertForMaskedLM.

All the weights of TFBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFDistilBertForMaskedLM.

All the weights of TFDistilBertForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForMaskedLM for predictions without further training.
All PyTorch model weights were used when initializing TFXLMRobertaForMaskedLM.

All the weights of TFXLMRobertaForMaskedLM were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFXLMRobertaForMaskedLM for predictions without further training.


Unnamed: 0,model,score,token,sequence
0,bert-base-multilingual-cased,0.08,słowa,""" Staram się być obiektywyny niezależnie co si..."
1,bert-base-multilingual-cased,0.03,znana,""" Staram się być obiektywyny niezależnie co si..."
2,bert-base-multilingual-cased,0.02,tzw,""" Staram się być obiektywyny niezależnie co si..."
3,bert-base-multilingual-cased,0.02,stała,""" Staram się być obiektywyny niezależnie co si..."
4,bert-base-multilingual-cased,0.02,"""",""" Staram się być obiektywyny niezależnie co si..."
5,distilbert-base-multilingual-cased,0.07,znana,""" Staram się być obiektywyny niezależnie co si..."
6,distilbert-base-multilingual-cased,0.02,nie,""" Staram się być obiektywyny niezależnie co si..."
7,distilbert-base-multilingual-cased,0.02,dopiero,""" Staram się być obiektywyny niezależnie co si..."
8,distilbert-base-multilingual-cased,0.02,pracy,""" Staram się być obiektywyny niezależnie co si..."
9,distilbert-base-multilingual-cased,0.02,człowieka,""" Staram się być obiektywyny niezależnie co si..."


7. Take into accout the fact, that causal language models such as PapuGaPT2 or plT5, will only generate continuations of the sentenes, so the examples have to be created according to that paradigm.

8. Answer the following questions (2 points):

**Which of the models produced the best results?**
The xlm-roberta-base model produced the most consistent results across tasks. It performed relatively well in recognizing Polish grammar, capturing some long-distance dependencies, and demonstrated a limited amount of general knowledge.

**Was any of the models able to capture Polish grammar?**
All of the models showed some ability to recognize Polish grammar, especially basic grammatical. However, they often struggled with more complex grammatical structures, especially in long sentences or cases requiring consistency in gender and form.

**Was any of the models able to capture long-distant relationships between the words?**
Each model occasionally managed to capture long-distance relationships, but xlm-roberta-base was the best in this area. It was more likely to maintain grammatical consistency over longer distances, though it still made mistakes with gender and agreement when dealing with more complex sentences.

**Was any of the models able to capture world knowledge?**
Xlm-roberta-base showed some understanding of general world knowledge, like identifying Mount Everest as the tallest mountain. However, all models had trouble with common knowledge, such as identifying water as the substance that boils at 100°C or gasoline as the fuel for cars. This suggests that the models' real-world knowledge is limited.

**Was any of the models good at doing zero-shot classification?**
Xlm-roberta-base performed moderately well in zero-shot classification but struggled with sentiment tasks.

**What are the most striking errors made by the models?**
The models made several notable errors, such as predicting irrelevant tokens like punctuation or characters (e.g., 勅) and producing incomplete fragments (e.g., "##dzi") that didn’t fit the context. They often struggled with grammar consistency, failing to maintain gender or case agreement in longer sentences. Additionally, they showed a lack of basic world knowledge, missing obvious facts like “water” boiling at 100°C or “gasoline” being a fuel for cars.

## Hints

1. Language modelling (LM) is a task concentrated on computing the probability distribution of words given a sequence of
   preceding words.
1. In the past the most popular LM were based on n-gram counting. The distribution of probability of the next word was
   approximated by the relative frequency of the last n-words, preceding this word. Usually n=3, since larger values
   resulted in extremely large datasets.
1. Many algorithms were devised to improve that probability estimates for infrequent words. Among them Kneser-Ney was
   the most popular.
1. SRI LM is the most popular toolkit for creating traditional language models.
1. At present recurrent neural networks, attention networks and transformers are the most popular neural-network
   architectures for creating LMs.
1. The largest LM currently is GPT-3 described in (mind the number of authors!) *Language Models are Few-Shot Learners*
   Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav
   Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon
   Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler,
   Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
   Sutskever, Dario Amodei