<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/tda_2025_exercise_task_1_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example solution for exercise task 1

## Task 1 part 1 description

Using the load_dataset function of the datasets library, load each of the following datasets in turn:

* [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb)
* [eriktks/conll2003](https://huggingface.co/datasets/eriktks/conll2003)
* [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)

For each of the datasets, report the following information:

* What NLP task is the dataset intended for (e.g. syntactic analysis, toxicity detection, etc.)? (You may need to refer to the documentation of the dataset for this.)
* What parts is the dataset split into (e.g. train, test) and how many examples does each contain?
* What features (e.g. text, label) does the dataset have? (Try to understand how these relate to the NLP task the dataset is intended for.)
* What is the first item in the training set of the dataset?

## Task 1 part 1 example solution

To answer the question "What NLP task is the dataset intended for?", we can refer to the linked descriptions:

* [stanfordnlp/imdb](https://huggingface.co/datasets/stanfordnlp/imdb): "*This is a dataset for binary sentiment classification*". This is the IMDB dataset of positive and negative movie reviews that is hopefully already familiar to you from previous courses.
* [eriktks/conll2003](https://huggingface.co/datasets/eriktks/conll2003): "*The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities*". This is the well-known CoNLL'03 NER dataset that was discussed on the lecture.
* [openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k): "*GSM8K (Grade School Math 8K) is a dataset of 8.5K high quality linguistically diverse grade school math word problems. [...] This dataset is generally used to test logic and math in language modelling. It has been used for many benchmarks, including the LLM Leaderboard.*" This is a dataset of math problems that is commonly used to evaluate LLMs.

The remaining three questions (split parts and their sizes, features, and first item of each) can be addressed e.g. as follows.

First, install the datasets python package:

In [None]:
!pip install --quiet datasets

Then, load each of the three datasets. You'll notice minor differences in the arguments here, with `conll2003` requiring `trust_remote_code=True` to execute code in the loader (which you should generally only allow if you trust the source, which we do here) and `gsm8k` requiring the identification of a configuration, for which we'll just use `main`. (We're loading these into a dict for convenience, but this isn't a required part of the solution.)

In [None]:
from datasets import load_dataset

dataset_dict = {
  'imdb': load_dataset('stanfordnlp/imdb'),
  'conll2003': load_dataset('eriktks/conll2003', trust_remote_code=True),
  'gsm8k': load_dataset('openai/gsm8k', 'main'),
}

We can then write a simple function to print out the required information and invoke it with each of the datasets in turn. (You can find out how to get this information either by inspecting the dataset object or by referencing the documentation.)

In [None]:
def print_dataset_info(dataset):
  parts = dataset.keys()
  for part in parts:
    print(f'part "{part}":')
    print(f'\tsize: {dataset[part].num_rows}')
    print(f'\tfeatures: {list(dataset[part].features.keys())}')
  print(f'first item in training set:\n{dataset["train"][0]}')

for n, d in dataset_dict.items():
  print(f'{n}:')
  print_dataset_info(d)
  print('-'*78)

imdb:
part "train":
	size: 25000
	features: ['text', 'label']
part "test":
	size: 25000
	features: ['text', 'label']
part "unsupervised":
	size: 50000
	features: ['text', 'label']
first item in training set:
{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with

Some things you might note:
* All three datasets have `train` and `test` parts, but CoNLL'03 also has `validation` (for e.g. hyperparameter optimization) and IMDB also has `unsupervised` (for e.g. semi-supervised learning)
* IMDB and GSM8K contain plain text data (`text` and `question`/`answer`), but CoNLL'03 contains tokenized text (`tokens`) as it is a token-level labeling dataset.
* The features for IMDB and GSM8K are simply the texts and their labels in the former and the questions and their answers in the latter. In CoNLL'03, the `ner_tags` are the named entity tags that are the target of the task, while `pos_tags` and `chunk_tags` are provided as potential supporting features for approaches using hand-engineered features.

## Task 1 part 2 description

You can find data collected from the Yle news RSS feed here: http://dl.turkunlp.org/TKO_8964_2023/

Download either the Finnish or English data (`news-fi-2021.jsonl` or `news-en-2021.jsonl`) using `wget` and create a dataset from the JSONL data (see https://huggingface.co/docs/datasets/loading#json). Answer the following questions:

* What NLP tasks could the dataset be used for?
* What features does the dataset have?
* How many space-separated words do the texts of the dataset contain in total?

## Task 1 part 2 example solution

We'll first download the Finnish data using `wget` as instructed

In [None]:
!wget http://dl.turkunlp.org/TKO_8964_2023/news-fi-2021.jsonl

--2025-01-21 07:31:35--  http://dl.turkunlp.org/TKO_8964_2023/news-fi-2021.jsonl
Resolving dl.turkunlp.org (dl.turkunlp.org)... 195.148.30.23
Connecting to dl.turkunlp.org (dl.turkunlp.org)|195.148.30.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 36139303 (34M) [application/octet-stream]
Saving to: ‘news-fi-2021.jsonl’


2025-01-21 07:31:37 (16.3 MB/s) - ‘news-fi-2021.jsonl’ saved [36139303/36139303]



Let's check the first line

In [None]:
!head -n 1 news-fi-2021.jsonl

{"summary": "Perussuomalaiset valitsee viikonloppuna uuden puheenjohtajan väistyvän Jussi Halla-ahon seuraajaksi. Vaaliin on ilmoittautunut kaikkiaan viisi ehdokasta. Ennakkosuosikkeja heistä ovat kansanedustajat Purra ja Puisto. Miten he eroavat toisistaan?", "tags": ["A-studio", "Euroskeptiset poliittiset puolueet", "Jussi Halla-aho", "Perussuomalaiset", "Populistiset puolueet", "Riikka Purra", "Sakari Puisto", "Suomen poliittiset puolueet", "Suomen politiikka", "politiikka", "puolueet", "puoluejohtajat", "suomalaiset poliitikot"], "text": "Perussuomalaisten puoluekokous kokoontuu viikonloppuna Seinäjoella. Suurin huomio keskittyy puheenjohtajan valintaan. \n Nykyinen puheenjohtaja  Jussi Halla-aho  valittiin perussuomalaisten johtoon vuonna 2017. Kesäkuussa Halla-aho ilmoitti, ettei hae elokuun puoluekokouksessa jatkokautta. \n Ehdolla uudeksi puheenjohtajaksi on viisi nimeä:  Kristiina Ilmarinen ,  Sakari Puisto ,  Riikka Purra ,  Ossi Tiihonen  ja  Urpo Vähäranta . \n Näistä kansa

Create a dataset from the JSONL following the `load_dataset` documentation

In [None]:
d = load_dataset('json', data_files='news-fi-2021.jsonl')
print(d)

DatasetDict({
    train: Dataset({
        features: ['summary', 'tags', 'text', 'timestamp', 'title', 'url'],
        num_rows: 7482
    })
})


`load_dataset` automatically created a "train" split containing all the data. Let's again look at the first example and the features

In [None]:
d['train'][0]

{'summary': 'Perussuomalaiset valitsee viikonloppuna uuden puheenjohtajan väistyvän Jussi Halla-ahon seuraajaksi. Vaaliin on ilmoittautunut kaikkiaan viisi ehdokasta. Ennakkosuosikkeja heistä ovat kansanedustajat Purra ja Puisto. Miten he eroavat toisistaan?',
 'tags': ['A-studio',
  'Euroskeptiset poliittiset puolueet',
  'Jussi Halla-aho',
  'Perussuomalaiset',
  'Populistiset puolueet',
  'Riikka Purra',
  'Sakari Puisto',
  'Suomen poliittiset puolueet',
  'Suomen politiikka',
  'politiikka',
  'puolueet',
  'puoluejohtajat',
  'suomalaiset poliitikot'],
 'text': 'Perussuomalaisten puoluekokous kokoontuu viikonloppuna Seinäjoella. Suurin huomio keskittyy puheenjohtajan valintaan. \n Nykyinen puheenjohtaja  Jussi Halla-aho  valittiin perussuomalaisten johtoon vuonna 2017. Kesäkuussa Halla-aho ilmoitti, ettei hae elokuun puoluekokouksessa jatkokautta. \n Ehdolla uudeksi puheenjohtajaksi on viisi nimeä:  Kristiina Ilmarinen ,  Sakari Puisto ,  Riikka Purra ,  Ossi Tiihonen  ja  Urpo V

In [None]:
d['train'].features

{'summary': Value(dtype='string', id=None),
 'tags': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'text': Value(dtype='string', id=None),
 'timestamp': Value(dtype='timestamp[s]', id=None),
 'title': Value(dtype='string', id=None),
 'url': Value(dtype='string', id=None)}

The dataset consists of titles, summaries, and texts of news items, each associated with a timestamp, a URL, and a set of tags. This data could be used to support a variety of NLP tasks, for example summarization, tag generation, and language model training or evaluation.

Finally, let's grab the total number of space-separated words (tokens) in the texts.

In [None]:
total = 0

for item in d['train']:
  text = item['text']
  tokens = text.split()
  total += len(tokens)

print(total)

3413542
