<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/dl_in_hlt_2025_exercise_2_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example solution to exercise task 2

Task description:

> ### Load a dataset, and tokenize and vectorize the data
>
> In this exercise, the task is to load a text classification dataset using the `datasets` Python library, and tokenize and vectorize the loaded data using the tokenizer created in the exercise task 1. This exercise builds towards a full model training notebook.
>
> 1) Load the `imdb` movie review dataset using the `datasets` Python library. Here is a [helper notebook](https://github.com/TurkuNLP/intro-to-nlp/blob/master/datasets_introduction.ipynb) from the Introduction to Language Technology course in case you are not familiar with `datasets` or need a reminder.

We'll follow the initial steps in the referenced notebook to install and import the `datasets` library.

In [None]:
!pip install --quiet datasets

In [None]:
import datasets

Following the notebook further, we load the named dataset with the `load_dataset` function ([documentation](https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset)).

In [None]:
DATASET_NAME = 'imdb'

dataset = datasets.load_dataset(DATASET_NAME)

Let's have a quick look at the contents.

In [None]:
dataset

The `dataset` object we loaded is an instance of the `DatasetDict` class, which is a dictionary-like class that maps strings such as `train` to `Dataset` objects. So, we can reference one of the `Dataset` objects in the `DatasetDict` like this:

In [None]:
dataset['train']

Each item in the dataset contains a `text` and a `label`, as we would expect for a dataset for text classification. We can look at an individual example from the dataset by indexing it like a list:

In [None]:
dataset['train'][0]

Note that for the `imdb` dataset we have the conventional `train` and `test` parts and an `unsupervised` part that's somewhat specific to this dataset.  If you're interested in the intended role of this data, you can read about it in the [original paper introducing the IMDB dataset](https://aclanthology.org/P11-1015.pdf).

Task description:

> 2) Tokenize and vectorize the dataset using a tokenizer created in the exercise task 1. The tokenizer can be either monolingual English or multilingual (e.g. `bert-base-cased` or `bert-base-multilingual-cased`). The outcome of a tokenized and vectorized example should look something like this (some tokenizers do not produce token_type_ids):
>
>```
>{'attention_mask': [1, 1, 1, ... , 1],
> 'input_ids': [101, 146, 12765, ... , 102],
> 'token_type_ids': [0, 0, 0, ..., 0]}
>```
>
> Hint: To tokenize and vectorize the whole dataset, write a function which receives one example, and returns it's tokenized+vectorized version. Apply this function to each example in the dataset using `dataset.map()`.

We'll first grab the a tokenizer as in exercise task 1:

In [None]:
!pip install --quiet transformers

In [None]:
from transformers import AutoTokenizer

TOKENIZER_NAME = 'bert-base-cased'

tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)

As we have already seen on the lecture, we can get tokenized and vectorized output of the type that this task requires simply by invoking the tokenizer directly:

In [None]:
tokenizer('Hello world!')

The task asks us to process the whole dataset using the [`Dataset.map()`](https://huggingface.co/docs/datasets/en/process#map) function. We could do this for each of the `Dataset` object contained in the top-level `DatasetDict` object separately (e.g. in a loop), but for convenience `DatasetDict` also implements functions such as `map()` ([documentation](https://huggingface.co/docs/datasets/en/process#process-multiple-splits)) in just this way, calling them for all of its contained `Dataset` objects. We can thus ignore the difference between `Dataset` and `DatasetDict` here and call `map()` directly with the object we have loaded.

As suggested by the hint, let's write a function that invokes the tokenizer. We'll first try to make this work with just text strings.

In [None]:
def tokenize_string(text):
  return tokenizer(text)

This works as we would expect for strings

In [None]:
tokenize_string('Hello world!')

Now, let's try to use the `map` function ([documentation](https://huggingface.co/docs/datasets/en/process#map)) referenced in the hint to apply that to the dataset.

In [None]:
dataset.map(tokenize_string)

That didn't work because the tokenizer expected a string (or lists or lists of lists of strings). What do we have instead? (The `sys.exit` below is just to avoid the callback.)

In [None]:
import sys

def debug_tokenize(example):
  print(type(example), example)
  sys.exit()

dataset.map(debug_tokenize)

Looks like we have a dictionary with `text` and `label`, which makes sense -- this is what each example in the dataset contains. It's the text we want to tokenize, so let's try that.

In [None]:
def tokenize(example):
  return tokenizer(example['text'])

tokenized_dataset = dataset.map(tokenize)

tokenized_dataset

That appears to have worked as intended, with `DatasetDict.map()` applying our tokenization function to all of the included `Dataset` objects. Note and that the new `DatasetDict` returned by the function contains not only the fields returned by the tokenizer (`input_its`, `token_type_ids`, and `attention_mask`) but also the original ones (`text` and `label`).

Note that the `map` function did not modify the original `DatasetDict` or the `Dataset` objects that it contains, so these are different:

In [None]:
print('Original:', dataset)
print('Tokenized:', tokenized_dataset)