<a href="https://colab.research.google.com/github/TurkuNLP/textual-data-analysis-course/blob/main/text_classification_dataset_examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text classification dataset examples

Let's have a look at some text classification datasets from the Hugging Face datasets repository (https://huggingface.co/datasets).

(You can find a tutorial to `datasets` here: https://huggingface.co/docs/datasets/tutorial)

First, install the `datasets` Python package:

In [92]:
!pip install --quiet datasets

Make loading a bit less verbose. (This only affects what shows on screen when loading.)

In [93]:
from datasets import disable_progress_bar

disable_progress_bar()

We'll mainly use the `load_dataset` function to download data from the repository by name. To see what's available for this task, you can navigate to https://huggingface.co/datasets and select "text classification" from the filters. You can also use `load_dataset_builder` for information on a dataset, as shown in the following.

## Example: IMDB

You might already be familiar with this one. (Here, `pprint` is a pretty-printing library)

In [94]:
from datasets import load_dataset, load_dataset_builder
from pprint import pprint

builder = load_dataset_builder('imdb')

print(builder.info.description)

Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.


In [95]:
imdb = load_dataset('imdb')



Printing the dataset object shows a summary of its contents:

In [96]:
print(imdb)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


So, here we have three parts:

* `train`: 25,000 examples
* `test`: 25,000 examples
* `unsupervised`: 50,000 examples

(The former two are conventional, the last atypical.)

Each example has a `text` and a `label`.

Let's look at an example:

In [97]:
pprint(imdb['train'][0]['text'])
print('\nLabel:', imdb['train'][0]['label'])

('I rented I AM CURIOUS-YELLOW from my video store because of all the '
 'controversy that surrounded it when it was first released in 1967. I also '
 'heard that at first it was seized by U.S. customs if it ever tried to enter '
 'this country, therefore being a fan of films considered "controversial" I '
 'really had to see this for myself.<br /><br />The plot is centered around a '
 'young Swedish drama student named Lena who wants to learn everything she can '
 'about life. In particular she wants to focus her attentions to making some '
 'sort of documentary on what the average Swede thought about certain '
 'political issues such as the Vietnam War and race issues in the United '
 'States. In between asking politicians and ordinary denizens of Stockholm '
 'about their opinions on politics, she has sex with her drama teacher, '
 'classmates, and married men.<br /><br />What kills me about I AM '
 'CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. '
 "Really, 

Here the `label` is just the value `0`. To interpret this, we can look at `features`:

In [98]:
print(imdb['train'].features['label'].names)

['neg', 'pos']


So, the label `0` corresponds to `neg` (negative) and the label `1` to `pos` (positive). 

---

## Example: `emotion`

Following the same process as above:

In [99]:
builder = load_dataset_builder('emotion')

print(builder.info.description)



Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.



In [100]:
emotion = load_dataset('emotion')



In [101]:
print(emotion)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


So, three parts:

* `train`: 16,000 examples
* `validation`: 2000 examples
* `test`: 2000 examples

(This is a common way to partition a dataset that you should already be familiar with.)

In [102]:
pprint(emotion['train'][0]['text'])
print('\nLabel:', emotion['train'][0]['label'])

'i didnt feel humiliated'

Label: 0


In [103]:
print(emotion['train'].features['label'].names)

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']


---

## Example: SNLI

In [104]:
builder = load_dataset_builder('snli')

print(builder.info.description)

The SNLI corpus (version 1.0) is a collection of 570k human-written English
sentence pairs manually labeled for balanced classification with the labels
entailment, contradiction, and neutral, supporting the task of natural language
inference (NLI), also known as recognizing textual entailment (RTE).



In [105]:
snli = load_dataset('snli')



In [106]:
print(snli)

DatasetDict({
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 550152
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 10000
    })
})


Again, the conventional three parts:

* `train`: 550,152 examples
* `validation`: 10,000 examples
* `test`: 10,000 examples

However, this time instead of just `text` and `label` we have `premise`, `hypothesis`, and `label`:

In [107]:
pprint(snli['train'][0]['premise'])
pprint(snli['train'][0]['hypothesis'])
print('\nLabel:', snli['train'][0]['label'])

'A person on a horse jumps over a broken down airplane.'
'A person is training his horse for a competition.'

Label: 1


In [108]:
print(snli['train'].features['label'].names)

['entailment', 'neutral', 'contradiction']
