# Neural Natural Language Generation (NNLG)

Welcome to the First Lesson of the Neural NLG Tutorial!

In this session, you will learn:

- How to find, load, and preprocess a dataset from the Hugging Face Dataset Repository and Datasets library.

- How to find, load and apply a model from the Hugging Face Models repository and Transformers library on a dataset.


## 1. Finding, Loading, and Processing Datasets

### The Hugging Face Dataset Repository

We will start by going to the [`Hugging Face Homepage`](https://huggingface.co).

Hugging Face offers a lot of different useful an interesting resources, but for now we will focus on its extense Dataset repository. To do so, we can click on the [`Datasets`](https://huggingface.co/datasets) button on the navigation bar, which should redirect us to the repository.

On the left side of the screen you will find a series of categories (Main, Tasks, Libraries, Languages, Licences, Other). By clicking on these categories, you will be presented with different options to facilitate your search. A the top of the right side you will find a search bar as well as a dropdown to choose how to sort your results.

For this lesson we will work on a classic Text Generation task: Machine Translation. Lets look for the [`Helsinki-NLP/europarl`](https://huggingface.co/datasets/Helsinki-NLP/europarl) dataset.

On this view we have some information about the dataset: the subsets, the splits, the columns, and even the visualization of some examples.

As you can see, this is a rather large dataset with thousands of subsets (different language pairs).

Each Data Instance has a single column (`translation`) which contains a dictionary with parallel text in two different languages.

### The `datasets` Python library

Together with their Dataset repository, Hugging Face also provides the [`datasets`](https://huggingface.co/docs/datasets/en/index) library. You can use this library to load their datasets but also to create your own custom datasets.

First, we need to install the library in case we have not installed it already:

In [2]:
! pip install datasets --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Once the library has been installed we can import the `load_dataset` function that allows us to download the dataset and manipulate it in Python.

In [3]:
from datasets import load_dataset

After importing the function, we can get our dataset. If you want a quick overview of the `load_dataset` function you can check [this link](https://huggingface.co/docs/datasets/v3.0.1/en/about_dataset_load#eli5-loaddataset) or have a more in-depth look on [this other link](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/loading_methods#datasets.load_dataset).

For now, lets start by saying that first argument of the function is the identifier of the dataset you want to load. In our case, that would be [`Helsinki-NLP/europarl`](https://huggingface.co/datasets/Helsinki-NLP/europarl). Following the name of the dataset we can provide a specific subset, in this case we can indicate `en-fr` to get the English-French subset. Finally, we can use the `split` argument to specify which split we want to load, lets load the `train` split.

Finally, sometimes the datasets we will be working with might be too big to download. The `load_dataset` function provides the `streaming` argument which, when set to `True` will load the dataset as a generator and only fetch the intstances as you iterate over it.

With all this knowledge lets get our dataset:

In [4]:
dataset = load_dataset(
    'Helsinki-NLP/europarl',
    'en-fr',
    split='train',
    streaming=True
)

### Exploring a dataset

We can now explore one sample of the dataset.

We can use `dataset.__iter__().__next__()` to obtain the first element of the dataset (simple slicing like `dataset[0]` does not work when using an `IterableDataset` (ie. we set `streaming=True`).

We also have a small `nested_print` function that can help us print every value stored in the sample, even when there are nested dictionaries.

In [5]:
sample = dataset.__iter__().__next__()

def nested_print(key, element, level=0):
    if isinstance(element, dict):
        print(f'{"│ "*(level)}├─{key}:')
        for k, v in element.items():
            nested_print(k, v, level+1)
    else:
        print(f'{"│ "*(level)}├─{key}: {element}')

nested_print('sample', sample)

├─sample:
│ ├─translation:
│ │ ├─en: Resumption of the session
│ │ ├─fr: Reprise de la session


## Preprocessing a dataset
We can use the `.map(f:function)` method of the `dataset` object to apply some preprocessing to the data. By default, the function you provide to the `.map()` method will process one sample at a time and will return a dictionary with the columns to be overwritten or added to the sample. You can then use the `remove_columns` attribute to drop columns you wont need. To find more information about the `.map()` method visit [this link](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset.map).

In our case, lets turn the existing `translation` column into two columns: `source` for the English text and `target` for the French text.

In [6]:
def preprocess_europarl(sample):
    return {
        'source':sample['translation']['en'],
        'target':sample['translation']['fr']
    }

dataset = dataset.map(
    preprocess_europarl,
    remove_columns=[
        'translation'
    ]
)

## Filtering the dataset

The `dataset` object also includes a `.filter(f:function)` method. Like the `map()` method, by default the function provided to this method will evaluate one instance at a time. If the function returns `True` the instance is kept if it returns `False` it will be removed. You can find more information about the `.filter()` method on [this link](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset.filter).

In our case, lets filter out all the examples where the `source` or the `target` have less than 20 characters or more than 40 characters

In [7]:
def filter_europarl(sample):
    return (len(sample['source']) >= 20) and (len(sample['source']) <= 40) and (len(sample['target']) >= 20) and (len(sample['target']) <= 40)

dataset = dataset.filter(filter_europarl)

We can make use of the `.take(n:int)` method of the `dataset` objects to only select a small subset of the total available data, which is useful when the dataset is too big and we do not want or can process all of it.

In our case, lets only keep the first 10 samples in the dataset:

In [8]:
dataset = dataset.take(16)

In [9]:
def nested_print(key, element, level=0):
    if isinstance(element, dict):
        print(f'{"│ "*(level)}├─{key}:')
        for k, v in element.items():
            nested_print(k, v, level+1)
    else:
        print(f'{"│ "*(level)}├─{key}: {element}')

for sample in dataset:
    nested_print('sample', sample)
    print()

├─sample:
│ ├─source: Resumption of the session
│ ├─target: Reprise de la session

├─sample:
│ ├─source: It is the case of Alexander Nikitin.
│ ├─target: Il s'agit du cas d'Alexandre Nikitin.

├─sample:
│ ├─source: We do not know what is happening.
│ ├─target: Nous ne savons pas ce qui se passe.

├─sample:
│ ├─source: Relating to Wednesday:
│ ├─target: En ce qui concerne le mercredi :

├─sample:
│ ├─source: (Applause from the PSE Group)
│ ├─target: (Applaudissements du groupe PSE)

├─sample:
│ ├─source: Thank you, Mr Poettering.
│ ├─target: Merci Monsieur Poettering.

├─sample:
│ ├─source: It is not a lot to ask.
│ ├─target: Ce n' est pas demander beaucoup.

├─sample:
│ ├─source: There is no room for amendments.
│ ├─target: Les modifications n'ont pas lieu d'être.

├─sample:
│ ├─source: That did not happen.
│ ├─target: Mais ma demande n'a pas été satisfaite.

├─sample:
│ ├─source: I would urge you to endorse this.
│ ├─target: Je vous demande votre approbation.

├─sample:
│ ├─source: Th

## 2 Finding, Loading, and Using Models

### The Hugging Face Models Repository

Just like in the previous section, we will start by visiting the [`Hugging Face Homepage`](https://huggingface.co).

This time, we will focus on its Model repository. To do so, we can click on the [`Models`](https://huggingface.co/models) button on the navigation bar, which should redirect us to the repository.

Once again, the left side of the screen offers a series of categries (Tasks, Libraries, Dataset, Languages, Licences, Other). By clicking on these categories, you will be presented with different options to facilitate your search. A the top of the right side you will also find a search bar as well as a dropdown to choose how to sort your results.

For now, lets get an English-French translation model [`Helsinki-NLP/opus-mt-en-fr`](https://huggingface.co/Helsinki-NLP/opus-mt-en-fr).

On this view we have some information about the model like the suported languages, architecture, preprocessing, and benchmark results.

### The `transformers` Python library

Hugging Face's main library, [`transformers`](https://huggingface.co/docs/transformers/en/index) provides all the infraestructure to load, train, and use a variety of Models.

First, we need to install the library in case we have not installed it already:

In [10]:
! pip install transformers --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Once the library has been installed we can import the `pipeline` function that allows us to download and use our model in Python. We will also import the `torch` library which will help us detect the most powerful device available to run our model.

In [11]:
from transformers import pipeline
import torch

## Loading a model

After importing the `pipeline` function, we can get our model. If you want to have an in-depth look of the function you can visit [this link](https://huggingface.co/docs/transformers/v4.45.2/en/main_classes/pipelines#transformers.pipeline).

For now, lets just start with the most important argument of the function: `model`, which takes in  the identifier of the model you want to load; in our case, that would be [`Helsinki-NLP/opus-mt-en-fr`](https://huggingface.co/Helsinki-NLP/opus-mt-en-fr). Given the nature of the model, the pipeline will be instantiated as a `TranslationPipeline`.

We can use the `device` argument to specify where we want to load the model, in case we have a GPU available. Or load it on the CPU if that is all we have.

With all this knowledge lets get our model:

In [12]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
translator = pipeline(model='Helsinki-NLP/opus-mt-en-fr', device=device)

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]



## Applying a model
Now, all we need to do is pass some text to the pipeline and it will output the translation.

In [13]:
translator('Hello, how are you today?')

[{'translation_text': "Bonjour, comment allez-vous aujourd'hui ?"}]

As you can see, the output of the pipeline is a list that contains a dictionary for each input ptompt. This dictionary has a key (`translation_text`) which includes the translation of the model.

### Batching

We can even use batching to make the most out of our available computation, however, for that we need some additional steps.

First, we need to convert our `IterableDataset` into a normal `Dataset` since the pipeline can only do batching on the later. Because we already specified that we are only taking 16 instances from the dataset we won't have to deal with long download times or memory issues.

In [14]:
from datasets import Dataset
dataset = Dataset.from_generator(lambda: iter(dataset), features=dataset.features)
dataset

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['source', 'target'],
    num_rows: 16
})

Then to extract only the `source` column from our dataset. We can do that with the `KeyDataset` class from `transformers.pipelines.pt_utils`.


In [16]:
from transformers.pipelines.pt_utils import KeyDataset

for translation_batch in translator(KeyDataset(dataset, 'source')):
    print(translation_batch)

[{'translation_text': "Je suis d'accord avec votre analyse."}]
[{'translation_text': 'Merci beaucoup, Monsieur le Commissaire.'}]
[{'translation_text': '(La séance est levée à 20 h 25)'}]
[{'translation_text': 'Y a-t-il des commentaires?'}]
[{'translation_text': 'Merci beaucoup, Monsieur Cox.'}]
[{'translation_text': 'Je comprends ce que vous dites.'}]
[{'translation_text': 'Nous en avons pris note.'}]
[{'translation_text': "M. Wynn, c'est logique."}]
[{'translation_text': 'Madame Ahern, nous en avons pris note.'}]
[{'translation_text': '(Le procès-verbal est adopté)'}]
[{'translation_text': 'Elle devrait continuer sur cette voie.'}]
[{'translation_text': 'Tout cela est pour le bien du consommateur.'}]
[{'translation_text': "L'un d'eux est la subsidiarité."}]
[{'translation_text': 'Merci beaucoup, Monsieur Radwan.'}]
[{'translation_text': "C'est inacceptable pour moi."}]
[{'translation_text': "C'est inacceptable à mon avis."}]


In [17]:

all_translations = []
for translation_batch in translator(KeyDataset(dataset, 'source')):
    all_translations += [translation['translation_text'] for translation in translation_batch]

for source, target, translation in zip(dataset['source'], dataset['target'], all_translations):
    print('source:     ', source)
    print('target:     ', target)
    print('translation:', translation)
    print()

source:      I agree with your analysis.
target:      Je suis d'accord avec votre analyse.
translation: Je suis d'accord avec votre analyse.

source:      Thank you very much, Commissioner.
target:      Merci beaucoup, Monsieur le Commissaire.
translation: Merci beaucoup, Monsieur le Commissaire.

source:      (The sitting was closed at 8.25 p.m.)
target:      (La séance est levée à 20h25)
translation: (La séance est levée à 20 h 25)

source:      Are there any comments?
target:      Y a-t-il des observations ?
translation: Y a-t-il des commentaires?

source:      Thank you very much, Mr Cox.
target:      Merci beaucoup, Monsieur Cox.
translation: Merci beaucoup, Monsieur Cox.

source:      I understand what you are saying.
target:      Je vois ce que vous voulez dire.
translation: Je comprends ce que vous dites.

source:      We have taken note of this.
target:      Nous en avons pris note.
translation: Nous en avons pris note.

source:      Mr Wynn, that makes sense.
target:      C'e

## 1.3 Exercise

Explore the Hugging Face Dataset and Model repositories.

 - Find a dataset that can be used for a *text generation task* in your *native language*
 - Load and preprocess *8 instances* of the dataset
 - Find a *small model* that can be used to solve the task
 - Run your data instances through the model

### Group A

- Oyetunji ABIOYE
- Mehsen AZIZI
- Mohammad AL TAKACH

In [42]:
# Your Code Here
from datasets import load_dataset

dataset = load_dataset(
    'ImruQays/Rasaif-Classical-Arabic-English-Parallel-texts',
    split='train',
    streaming=True
)

Resolving data files:   0%|          | 0/24 [00:00<?, ?it/s]

In [43]:
sample = dataset.__iter__().__next__()

def nested_print(key, element, level=0):
    if isinstance(element, dict):
        print(f'{"│ "*(level)}├─{key}:')
        for k, v in element.items():
            nested_print(k, v, level+1)
    else:
        print(f'{"│ "*(level)}├─{key}: {element}')

nested_print('sample', sample)

├─sample:
│ ├─ar: وبعد، فلما كان السلطان الأعظم الملك الناصر، العالم المجاهد المرابط المتاغر، المؤيد المظفر المنصور، زين الدنيا والدين، سلطان الإسلام والمسلمين، محيى العدل فى العالمين، وارث ملك ملوك العرب والعجم والترك، ظل الله فى أرضه، القائم بسنته وفرضه
│ ├─en: To proceed: Since the great Sultan, the King, the Victor, the Sage, the Just, the Struggler, the Perseverer, the Trail-blazer, the God-supported, the Conquering, the Victorious, the Ornament of the World and of Religion, the Sultan of Islam and of the Muslims, the Rejuvenator of Justice in the Worlds, the Heir of the kingdom of the Kings of the Arabs and the Persians and the Turks, Shadow of God in His land, the Upholder of God’s sunnah and of His Ordinances.


In [48]:
def preprocess_en_ar(sample):
    return {
        'source':sample['en'],
        'target':sample['ar'],
    }

dataset = dataset.map(
    preprocess_en_ar,
    remove_columns=[
            'en',
            'ar'
        ])


In [49]:
# Filter out samples that are too short or too long, we increase the lower bound to 30 because Arabic is a more verbose language
def filter_en_ar(sample):
    return (len(sample['source']) >= 30) and (len(sample['source']) <= 60) and (len(sample['target']) >= 30) and (len(sample['target']) <= 60)

dataset = dataset.filter(filter_en_ar)

In [50]:
dataset = dataset.take(16)

In [51]:
for sample in dataset:
    nested_print('sample', sample)
    print()

├─sample:
│ ├─source: The last man to face me was Hamdawayh the Heavyweight.
│ ├─target: كان آخر من صادفني حمدويه أبو الأرطال.

├─sample:
│ ├─source: Pannum is bread donated charitably to prisoners and beggars.
│ ├─target: والزكوري: هو خبز الصدقة، كان على سجين أو على سائل.

├─sample:
│ ├─source: But then I see them dip it in the mustard.
│ ├─target: ثم لا ألبث أن أراهم يصنعون مثل ذلك بالخردل.

├─sample:
│ ├─source: “Boy, that chicken was tough. Bring me one that’s tender!”.
│ ├─target: ثم قال: يا غلام جئني بواحدة رخصة، فإن هذه كانت عضلة جدا.

├─sample:
│ ├─source: This approach of‘Ali’s is a major setback.
│ ├─target: وهذا المذهب من عليّ استضعاف شديد.

├─sample:
│ ├─source: “They have only one dish apiece, while you enjoy a variety”.
│ ├─target: إنما لكل بيت منهم لون واحد، وعندكم ألوان.

├─sample:
│ ├─source: I praise Him as befits His honor and sublime glory.
│ ├─target: أحمده حمداً كما ينبغي لكرم وجهه وعِز جلاله.

├─sample:
│ ├─source: Mention has been already made of Masila.
│ ├─tar

In [57]:
! pip install peft --quiet


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [None]:
import torch
from peft import PeftConfig, PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "ahmedheakl/arazn-gemma1.1-2B-arabic"
peft_config = PeftConfig.from_pretrained(peft_model_id)
base_model_name = peft_config.base_model_name_or_path
base_model = AutoModelForCausalLM.from_pretrained(base_model_name)
model = PeftModel.from_pretrained(base_model, peft_model_id)
model = model.to("cuda") if torch.cuda.is_available() else model.to("cpu")
tokenizer = AutoTokenizer.from_pretrained(peft_model_id)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/40.6k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

In [None]:
raw_prompt = """<bos><start_of_turn>user
Translate the following English text to Arabic only.
{source}<end_of_turn>
<start_of_turn>model
"""
def inference(prompt) -> str:
    prompt = raw_prompt.format(source=prompt)
    inputs = tokenizer(prompt, return_tensors="pt")
    generated_ids = model.generate(
        **inputs,
        use_cache=True,
        num_return_sequences=1,
        max_new_tokens=100,
        do_sample=True,
        num_beams=1,
        temperature=0.7,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )
    outputs = tokenizer.batch_decode(generated_ids)[0]
    return outputs.split("<start_of_turn>model\n")[-1].split("<end_of_turn>")[0]

In [72]:
print(inference("The last man to face me was Hamdawayh the Heavyweight")) # I like bananas.


آخر الناس ما اتحدت معي هو حميد الويزي


In [73]:
from datasets import Dataset
dataset = Dataset.from_generator(lambda: iter(dataset), features=dataset.features)
dataset

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['source', 'target'],
    num_rows: 16
})

In [79]:
source_list = list(dataset['source'])

all_translations = []
for item in source_list:
    translation = inference(item)
    all_translations.append(translation)

In [80]:
for source, target, translation in zip(dataset['source'], dataset['target'], all_translations):
    print('source:     ', source)
    print('target:     ', target)
    print('translation:', translation)
    print()

source:      This they continued to do till he left the Sacred Territory.
target:      ولا يزالون كذلك حتى يخرج من الحرم.
translation:  دههم تواصلوا حتى أراد عيش في أرض المقدسة.

source:      Mention has been already made of his ancestor Abu Burda.
target:      وقد تقدم ذكر جده أبي بردة في أول حرف العين.
translation: هل ذكر هو أجدده أبو بُردَ.

source:      Of Okbara I have already spoken in the life of Abi al-Bakaa.
target:      وعكبرا قد تقدم القول عليها في ترجمة الشيخ أبي البقاء.
translation: على أكبرا أنا مكتضرها في حياة أبي الجاسر.

source:      He then retired to his house and died some days afterwards.
target:      وخرج من مجلسه وأتى منزله وأقام أياما ومات.
translation: أين عملت؟ وأعدت نفسه في منزله ودفنشوف كمان؟

source:      A fine verse from one of his qasidas is the following.
target:      وله بيت بديع من جملة قصيدة وهو:
translation: مستقبل في عالم ديوان enslavement معارضة حقيقة؟

source:      Ibn al-Mutazz has the following lines on a similar subject.
target:      ولابن الم