<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/intro_2025_exercise_4_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise task 4: creating a dataset from corpus data (solution)

This notebook shows an example solution for exercise 4.


---

# Setup

In [47]:
!pip install --quiet datasets

In [48]:
import random

import datasets

---

# Process data into dataset

Download raw data from given URLs to local drive

In [49]:
!wget --quiet -nc http://dl.turkunlp.org/TKO_7095_2023/imdb-positives.txt
!wget --quiet -nc http://dl.turkunlp.org/TKO_7095_2023/imdb-negatives.txt

Read in one text per line format

In [50]:
positive_texts = open('imdb-positives.txt').readlines()
negative_texts = open('imdb-negatives.txt').readlines()

Reformat as list of dictionaries with `text` and `label`

In [51]:
positive_examples = [{ 'text': text, 'label': 'positive' } for text in positive_texts]
negative_examples = [{ 'text': text, 'label': 'negative' } for text in negative_texts]

Combine these

In [52]:
all_examples = positive_examples + negative_examples

**Important**! Shuffle your data so that you don't have all positives first followed by all negatives

In [53]:
random.shuffle(all_examples)

Split into train, validation and test sets

In [54]:
total_size = len(all_examples)

train_size = int(0.8*total_size)
valid_size = int(0.1*total_size)
test_size = total_size - train_size - valid_size

train_examples = all_examples[:train_size]
valid_examples = all_examples[train_size:train_size+valid_size]
test_examples = all_examples[train_size+valid_size:]

In [55]:
def check_data_integrity(train, valid, test):
  print("Examples in train, valid and test sets:")
  print(len(train), len(valid), len(test))
  print()
  print("Proportion of 'positive' in the datasets:")
  print("Train:", len([example['label'] for example in train if example['label']=="positive"])/len(train))
  print("Valid:", len([example['label'] for example in valid if example['label']=="positive"])/len(valid))
  print("Test:",len([example['label'] for example in test if example['label']=="positive"])/len(test))

In [56]:
check_data_integrity(train_examples, valid_examples, test_examples)

Examples in train, valid and test sets:
40000 5000 5000

Proportion of 'positive' in the datasets:
Train: 0.49945
Valid: 0.4974
Test: 0.507


In [57]:
# ALTERNATIVE TO SPLITTING USING SKLEARN
from sklearn.model_selection import train_test_split

all_labels = [example['label'] for example in all_examples]

train_examples, test_examples = train_test_split(all_examples,
                                                 train_size=0.8,
                                                 random_state=42,
                                                 stratify=all_labels)

test_labels = [example['label'] for example in test_examples]

test_examples, valid_examples = train_test_split(test_examples,
                                                 train_size=0.5,
                                                 random_state=42,
                                                 stratify=test_labels)

In [58]:
check_data_integrity(train_examples, valid_examples, test_examples)

Examples in train, valid and test sets:
40000 5000 5000

Proportion of 'positive' in the datasets:
Train: 0.5
Valid: 0.5
Test: 0.5


Reformat as datasets. Notice that `Dataset.from_dict` expects a dictionary of lists rather than a list of dictionaries.

In [59]:
def make_dataset(examples):
  texts, labels = [], []
  for e in examples:
    texts.append(e['text'])
    labels.append(e['label'])
  data = {
      'text': texts,
      'label': labels
  }
  return datasets.Dataset.from_dict(data)


train = make_dataset(train_examples)
valid = make_dataset(valid_examples)
test = make_dataset(test_examples)

Create `DatasetDict` to wrap the `Dataset` objects

In [60]:
data = {
    'train': train,
    'validation': valid,
    'test': test,
}

dataset = datasets.DatasetDict(data)

Check final dataset

In [61]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 40000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

## Alternative approach using Datasets early

In [64]:
positive_texts = open('imdb-positives.txt').readlines()
negative_texts = open('imdb-negatives.txt').readlines()

positive_examples = [{ 'text': text, 'label': 'positive' } for text in positive_texts]
negative_examples = [{ 'text': text, 'label': 'negative' } for text in negative_texts]

all_examples = positive_examples + negative_examples

Dataset({
    features: ['text', 'label'],
    num_rows: 50000
})

In [71]:
# In order to use the 'stratify_by_label' argument, we need to assign features

def make_dataset_modified(examples):
  texts, labels = [], []
  for e in examples:
    texts.append(e['text'])
    labels.append(e['label'])
  data = {
      'text': texts,
      'label': labels
  }
  features = datasets.Features({
      'text': datasets.Value('string'),
      'label': datasets.ClassLabel(names=['negative', 'positive'])
      })
  return datasets.Dataset.from_dict(data, features)

In [72]:
ds = make_dataset_modified(all_examples)

In [80]:
# Now the dataset has features
ds.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['negative', 'positive'], id=None)}

In [81]:
# And we can access the label names
ds.features['label'].names

['negative', 'positive']

In [85]:
# NOTE! Now the class names have been mapped to 0 and 1
ds[0]

{'text': 'Zentropa has much in common with The Third Man, another noir-like film set among the rubble of postwar Europe. Like TTM, there is much inventive camera work. There is an innocent American who gets emotionally involved with a woman he doesn\'t really understand, and whose naivety is all the more striking in contrast with the natives.<br /><br />But I\'d have to say that The Third Man has a more well-crafted storyline. Zentropa is a bit disjointed in this respect. Perhaps this is intentional: it is presented as a dream/nightmare, and making it too coherent would spoil the effect. <br /><br />This movie is unrelentingly grim--"noir" in more than one sense; one never sees the sun shine. Grim, but intriguing, and frightening.\n',
 'label': 1}

In [76]:
# Now, we can split into train and test with stratification
# Unlike sklearns train_test_split, this method return a DatasetDict
ds_train_test = ds.train_test_split(train_size=0.8, stratify_by_column="label")
ds_train_test

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 40000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 10000
    })
})

In [86]:
# Split the test set into valid and test
ds_valid_test = ds_train_test['test'].train_test_split(test_size=0.5, stratify_by_column="label")
ds_valid_test

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

In [87]:
data = datasets.DatasetDict({
    'train': ds_train_test['train'],
    'valid': ds_valid_test['train'],
    'test': ds_valid_test['test']
})
data

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 40000
    })
    valid: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})