<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/intro_2023_exercise_4_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise task 4: creating a dataset from corpus data (solution)

This notebook shows an example solution for exercise 4.


---

# Setup

In [1]:
!pip install --quiet datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 KB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.6/264.6 KB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [5]:
import random

import datasets

---

# Process data into dataset

Download raw data from given URLs to local drive

In [3]:
!wget --quiet -nc http://dl.turkunlp.org/TKO_7095_2023/imdb-positives.txt
!wget --quiet -nc http://dl.turkunlp.org/TKO_7095_2023/imdb-negatives.txt

Read in one text per line format

In [6]:
positive_texts = open('imdb-positives.txt').readlines()
negative_texts = open('imdb-negatives.txt').readlines()

Reformat as list of dictionaries with `text` and `label`

In [9]:
positive_examples = [{ 'text': text, 'label': 'positive' } for text in positive_texts]
negative_examples = [{ 'text': text, 'label': 'negative' } for text in negative_texts]

Combine these

In [10]:
all_examples = positive_examples + negative_examples

**Important**! Shuffle your data so that you don't have all positives first followed by all negatives

In [11]:
random.shuffle(all_examples)

Split into train, validation and test sets

In [12]:
total_size = len(all_examples)

train_size = int(0.8*total_size)
valid_size = int(0.1*total_size)
test_size = total_size - train_size - valid_size

train_examples = all_examples[:train_size]
valid_examples = all_examples[train_size:train_size+valid_size]
test_examples = all_examples[train_size+valid_size:]

assert all_examples == train_examples + valid_examples + test_examples

Reformat as datasets. Notice that `Dataset.from_dict` expects a dictionary of lists rather than a list of dictionaries.

In [13]:
def make_dataset(examples):
  texts, labels = [], []
  for e in examples:
    texts.append(e['text'])
    labels.append(e['label'])
  data = {
      'text': texts,
      'label': labels      
  }
  return datasets.Dataset.from_dict(data)


train = make_dataset(train_examples)
valid = make_dataset(valid_examples)
test = make_dataset(test_examples)

Create `DatasetDict` to wrap the `Dataset` objects

In [14]:
data = {
    'train': train,
    'validation': valid,
    'test': test,    
}

dataset = datasets.DatasetDict(data)

Check final dataset

In [15]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 40000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})