<a href="https://colab.research.google.com/github/TurkuNLP/Turku-hockey-data2text/blob/main/turku_hockey_data2text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Turku Hockey Data2Text

The Turku Hockey Data2Text corpus is a **manually curated corpus for Finnish news generation in the area of ice hockey reporting**. It was developed for a benchmark for evaluating template-free, machine learning methods on Finnish ice hockey news generation.

The dataset includes 3,454 ice hockey games, where the game statistics are aligned for a corresponding news article describing the game outcome. Each game is composed of a list of events, e.g. goal or penalties, extracted from the game statistics. During the manual annotation, **each event is manually aligned into a sentence-like passage reporting the event in the news article**, and in case a suitable passage was not found, the annotation is left empty. Furthermore, the extracted **passages were manually modified not to include additional information not derivable from the game statistics, or not considered as world knowledge**. The manual curation of passages is designed to prevent model hallucination, i.e. model learning to generate facts not derivable from the input data.

Thus, the dataset can be used to train models for generating natural language descriptions for ice hockey game events.

Example (in simplified format):

```
TPS–HPK 0–1 (0–1, 0–0, 0–0) ||| HPK kukisti TPS:n vieraissa 1–0 (1–0, 0–0, 0–0).

HPK Mikko Mäenpää 0–1 power play 14.57 ||| HPK hyödynsi ylivoimaa mennen jo ensimmäisessä erässä Mikko Mäenpään maalilla 1–0 -johtoon.
```


In [4]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 7.2 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 55.5 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.1.2-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 7.6 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 58.9 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 58.8 MB/s 
Collecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout-4.0.1-py3-none-any.whl (5.7 kB)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.2.0-cp37-cp37m-manylinux

# Loading the dataset



In [5]:
from datasets import load_dataset, Dataset

# load data from Huggingface datasets
dataset = load_dataset("TurkuNLP/turku_hockey_data2text")

# print one example
print(dataset["train"][15])
print(dataset)

Downloading:   0%|          | 0.00/9.31k [00:00<?, ?B/s]

No config specified, defaulting to: turku_hockey_data2_text/main


Downloading and preparing dataset turku_hockey_data2_text/main to /root/.cache/huggingface/datasets/TurkuNLP___turku_hockey_data2_text/main/1.1.0/cd4776a2e584679b46f472bb2939cfc0d7c1575156c07595abc5f48314810fb6...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/2.79M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/343k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/348k [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset turku_hockey_data2_text downloaded and prepared to /root/.cache/huggingface/datasets/TurkuNLP___turku_hockey_data2_text/main/1.1.0/cd4776a2e584679b46f472bb2939cfc0d7c1575156c07595abc5f48314810fb6. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

{'id': 'stt-topics:381124-TPS-HIFK', 'news_article': 'HIFK:n Robert Leinon avausmaali runsaan minuutin pelin jälkeen pohjusti helsinkiläisten niukan 2–1-voiton Turun Palloseurasta lauantaina Turkuhallissa. Näin TPS kärsi kolmannen perättäisen tappion jääkiekkoliigassa.\nHIFK:lla oli hyvä mahdollisuus kaunistaa maalilukujaan toisen erän puolivälissä, kun joukkue pääsi pelaamaan lähes kaksi minuuttia kahden miehen ylivoimalla. Oskari Siikin kaksiminuuttisen lisäksi TPS:n Henrik Tallinder sai pelirangaistuksen korkeasta mailasta, mutta maalinteko ei kuitenkaan helsinkiläisiltä ylivoimalla onnistunut.\n– Hieno suoritus joukkueelta raskaan peliviikon jälkeen, ja pystyimme onneksi pitämään johtoaseman loppuun asti, iloitsi HIFK:n päävalmentaja Antti Törmänen.\nTPS:n päävalmentaja Ari-Pekka Selin harmitteli takaa-ajon epäonnistumista kolmannessa erässä.\n– Emme pelanneet huonosti, mutta ratkaisuja ei vaan saatu aikaan, Selin harmitteli.\n', 'events': {'event_id': ['E1', 'E2', 'E3', 'E4', 'E5'

# Prepare the data for generation

* Example how to prepare the dataset for simple **event-level sequence-to-sequence generation**, where
  * input: one event (string)
  * output: description of the event
* Limitations of the **simplified representation**:
  * game level features are not included (game level features can be used to represent relations between events)
  * multi-reference events are discarded (multiple events aligned to a same text passages)

In [6]:
# prepare simplified data for data2text generation
# one event in, description out

import re

# relevant keys in input representation for different event types (text and event_id skipped as not being relevant for the input)
relevant_keys = {"game result": ["event_type", "home_team", "guest_team", "score", "periods", "features"],\
                 "goal": ["event_type", "score", "features", "player", "assist", "team", "team_name", "time"],\
                 "penalty": ["event_type", "player", "team", "team_name", "time", "penalty_minutes"],\
                 "saves": ["event_type", "player", "team", "team_name", "saves"]}

def event2string(i, events):
  """Featurize i:th event into string input.
     Example:
        input: "event_id: E17 [SEP] event_type: saves [SEP] player: Jani Hurme [SEP] team: home [SEP] team_name: TPS [SEP] saves: 25"
        output: "TPS:n maalissa Jani Hurme ehti 25 kiekon tielle."
  """
  if events["text"][i] == "": # skip, event is not annotated
      return None, None
  if events["multi_reference"][i] == True: # skip multireference events in simple representation
      return None, None
  event_type = events["event_type"][i] # use only relevant features for this event type
  event_input = " [SEP] ".join(f"{key}: {events[key][i] if isinstance(events[key][i], str) else ' , '.join(f for f in events[key][i])}" for key in relevant_keys[event_type])
  return event_input, events["text"][i]


simplified_data = {}

for data_split in dataset.keys():
  simplified_data[data_split] = []
  # iterate over games
  for game in dataset[data_split]:
    events = game["events"]
    # iterate over events in a game
    for i, event_id in enumerate(events["event_id"]):
      event_input, event_output = event2string(i, events)
      if not event_input: # skip empty annotations and multi-reference events
        continue 
      simplified_data[data_split].append({"input": event_input, "output": event_output})


for data_split in simplified_data.keys():
  print(f"simplified dataset {data_split}:", len(simplified_data[data_split]))
  print(f"First example in {data_split}:", simplified_data[data_split][0])


simplified dataset train: 6159
First example in train: {'input': 'event_type: game result [SEP] home_team: TPS [SEP] guest_team: HPK [SEP] score: 0–2 [SEP] periods: 0–2 , 0–0 , 0–0 [SEP] features: ', 'output': 'HPK kukisti TPS:n vieraissa 2–0 (2–0, 0–0, 0–0).'}
simplified dataset validation: 755
First example in validation: {'input': 'event_type: game result [SEP] home_team: HPK [SEP] guest_team: Ilves [SEP] score: 3–1 [SEP] periods: 0–0 , 2–0 , 1–1 [SEP] features: ', 'output': 'Kotikaukalossaan pelannut HPK vei 3–1 (0–0, 2–0, 1–1) -voiton Ilveksestä.'}
simplified dataset test: 706
First example in test: {'input': 'event_type: game result [SEP] home_team: Ässät [SEP] guest_team: Lukko [SEP] score: 3–2 [SEP] periods: 1–0 , 1–1 , 0–1 , 1–0 [SEP] features: ', 'output': 'Ässät otti pisteet Lukolta lukemin 3–2 (1–0, 1–1, 0–1, 1–0).'}


# Data statistics

* Total number of games: 3,454
* Games with at least one annotated event: 2,248
* Total number of events: 58,490
* Annotations:
  * Aligned events: 12,827
  * Aligned text passages: 9,272

* **Single- and multi-reference alignments**
  * single-reference means that a text passage is aligned into exactly one event, while multi-reference means cases where a text passage is aligned into several events
  * Example of single-reference text passage: `In the second period Steve Moses scored a 2–1 lead.` (one goal)
  * Example of multi-reference text passage: `The home team received two penalties towards the end of the first period.` (two penalties)
  * **Single-refererence annotations**
    * events: 7,620
    * text passages: 7,620
  * **Multi-reference annotations**
    * events: 5,207
    * text passages: 1,652


* **Event types** (calculated from all annotated events):
  * goal: 6981
  * penalty: 2226
  * game result: 2203
  * saves: 1417

In [23]:
from collections import Counter


games = 0
nonempty_games = 0
all_events = 0
annotated_events = 0
single_events = 0
multi_events = 0
annotated_event_types = Counter()
for dsplit in dataset.keys():
  for game in dataset[dsplit]:
    games += 1
    events = game["events"]
    if "".join(t for t in events["text"]) != "":
        nonempty_games += 1
    for i, event_id in enumerate(events["event_id"]):
      all_events += 1
      if events["text"][i] == "": # not annotated
        continue
      annotated_events += 1
      if events["multi_reference"][i]:
          multi_events += 1
      else:
          single_events += 1
      annotated_event_types.update([events["event_type"][i]])
print(f"Number of games: {games} (games with at least one annotated event: {nonempty_games})")
print("Total number of events:", all_events)
print("Annotated events:", annotated_events)
print("Single-reference events:", single_events)
print("Multi-reference events:", multi_events)
print("Event types (calculated from all annotated events):", annotated_event_types.most_common(10))



Number of games: 3454 (games with at least one annotated event: 2248)
Total number of events: 58490
Annotated events: 12827
Single-reference events: 7620
Multi-reference events: 5207
Event types (calculated from all annotated events): [('goal', 6981), ('penalty', 2226), ('game result', 2203), ('saves', 1417)]
