<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/intro_2023_exercise_3_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise task 3: text classification corpora (solution)

This notebook shows an example solution for exercise 3.


---

## Setup

Install and import packages

In [1]:
!pip install --quiet datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 KB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 KB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

In [6]:
import datasets

from collections import Counter

Adjust verbosity for the datasets library. (This only affects what shows on screen.) 

In [7]:
datasets.disable_progress_bar()
datasets.logging.set_verbosity_error()

---

## Print dataset info

Python function that takes a dataset name as an argument, loads the dataset, and prints dataset description, relative sizes of subsets of the dataset, and label distribution in `'train'`.

In [9]:
def print_dataset_info(name):
  dataset = datasets.load_dataset(name)
  builder = datasets.load_dataset_builder(name)
  print(f'--- Description for {name} ---')
  print(builder.info.description)
  print(f'--- Statistics for {name} ---')

  print('Relative sizes of subsets:')
  subsets = dataset.keys()
  total = sum(dataset[s].num_rows for s in subsets)
  for s in sorted(subsets):
    ratio = dataset[s].num_rows/total
    print(f'    {s}: {ratio:.1%} ({dataset[s].num_rows}/{total})')

  print('Label distribution in "train" subset:')
  label_counts = Counter(dataset['train']['label'])
  label_names = dataset['train'].features['label'].names
  total = sum(label_counts.values())
  for k, v in label_counts.items():
    print(f'    {label_names[k]}: {v/total:.1%}')

  print()

Run the function on the datasets `'emotion'`, `'rotten_tomatoes'`, `'snli'`, `'sst2'`, `'emo'`.

In [10]:
for name in ('emotion', 'rotten_tomatoes', 'snli', 'sst2', 'emo',):
  print_dataset_info(name)

Downloading and preparing dataset emotion/split to /root/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd...
Dataset emotion downloaded and prepared to /root/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd. Subsequent calls will reuse this data.
--- Description for emotion ---
Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.

--- Statistics for emotion ---
Relative sizes of subsets:
    test: 10.0% (2000/20000)
    train: 80.0% (16000/20000)
    validation: 10.0% (2000/20000)
Label distribution in "train" subset:
    sadness: 29.2%
    anger: 13.5%
    love: 8.2%
    surprise: 3.6%
    fear: 12.1%
    joy: 33.5%

Downloading and preparing dataset rotten_tomatoes/default to /root/.cache/huggingface/datasets/rotten_tomatoes/default/1.

> What patterns can you notice in the relative sizes of the subsets? Can you tell why this might be? 

The training set is always clearly the largest of the subsets, constituting 80% - 96.5% of the data for all datasets.

This is because once you have enough data in the **validation** or **test** subsets to reasonably accurately estimate performance (e.g. a few thousand examples), there's little benefit to increasing their size, while increasing the size of the size of the **training** set generally allows for better classifiers to be trained on the dataset.