# LIA subsets

LIA was divided into train, dev and test in [UD](https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA/tree/8a4ea1a6e0e1fbb4ef5ba34c2d408563e9c8cf9a).

The [LIA](https://github.com/textlab/spoken_norwegian_resources/tree/master/treebanks/Norwegian-NynorskLIA) conllu-treebank is divided into 18 files, 1 per speaker/conversation.

Each sentence has a unique `sent_id` across all partitions in UD, as opposed to LIA where each sentence is given a file-internal `id` which is incremental from 1 in each file. 

Here we map the speaker/file-id from UD + sent_id back to LIA to recreate the partitions. 

# Load data

In [None]:
from pathlib import Path

import conllu
from conllu import parse

UD_path = Path("../data/UD_Norwegian-NynorskLIA")
LIA_path = Path("../spoken_norwegian_resources/treebanks/Norwegian-NynorskLIA")
LIA_old_path = Path("../spoken_norwegian_resources/treebanks/Norwegian-NynorskLIA_old")


def load_partition(filepath: Path, partition: str = "train") -> list:
    """Load one of the UD dataset partitions train, dev, or test."""
    data = next(filepath.glob(f"*{partition}.conllu")).read_text()
    sentences = parse(
        data,
        metadata_parsers={
            "sent_id": lambda key, value: (key, value),
            "text": lambda key, value: (key, value),
            "__fallback__": lambda key, value: [
                [k.rstrip(":"), key.split()[i + 1]]
                for i, k in list(enumerate(key.split()))[::2]
            ],
        },
    )
    return sentences


def load_lia_sentences(
    filestem: str,
    dir_path: Path = Path("../spoken_norwegian_resources/treebanks/"),
    use_old: bool = False,
):
    LIA_path = dir_path / "Norwegian-NynorskLIA"
    LIA_old_path = dir_path / "Norwegian-NynorskLIA_old"
    filename = filestem + ".conll"

    lia_file = LIA_path / filename
    # Shortcut to the old version if use_old is True
    # otherwise use it as fallback
    lia_file = LIA_path / filename if not use_old else LIA_old_path / filename
    try:
        lia_data = lia_file.read_text()
    except FileNotFoundError:
        try:
            lia_file = LIA_old_path / filename
            lia_data = lia_file.read_text()
        except FileNotFoundError:
            print(f"Couldn't load {filestem}")
            lia_data = ""
    finally:
        lia_sentences = parse(lia_data)
    return lia_sentences

In [None]:
partition = "dev"
(UD_sentences := load_partition(UD_path, partition))

# Mapping 
Iterate over UD sentences and map them to the corresponding LIA sentence

In [None]:
def map_ud_partition_to_lia_sentences(sentences: conllu.models.TokenList) -> dict:
    mapping = {}
    no_match = {}

    for sentence in sentences:
        sent_id = sentence.metadata["sent_id"]
        UD_text = sentence.metadata["text"]
        filestem = sentence.metadata["speakerid"]
        lia_sentences = load_lia_sentences(filestem)
        for sent in lia_sentences:
            LIA_text = sent.metadata["text"]
            if (LIA_text == UD_text) or (LIA_text == UD_text.rstrip(" .")):
                mapping[sent_id] = sent

        if sent_id not in mapping:
            no_match[sent_id] = sentence

    return {"match": mapping, "no_match": no_match}

In [None]:
mapping = map_ud_partition_to_lia_sentences(UD_sentences)

# Annotate the partition with correct sent_ids and save

In [None]:
lia_partition = []
for sent_id, sentence in mapping["match"].items():
    sentence.metadata["sent_id"] = sent_id
    lia_partition.append(sentence)

In [None]:
# Save to disk
def save_conllu(data, filepath: Path | str):
    with open(filepath, "w") as f:
        f.writelines([sentence.serialize() for sentence in data])

In [None]:
new_LIA_UD_file = Path(f"../data/lia_{partition}.conllu")
save_conllu(lia_partition, new_LIA_UD_file)

## Handle mis-matches

In [None]:
no_match = mapping["no_match"]
print(f"No LIA sentence was found for {len(no_match)} UD {partition} sent_ids.")

no_match_fname = f"no_match_{partition}.txt"

with open(no_match_fname, "w") as f:
    f.writelines("\n".join(no_match.keys()))

print(f'They have been saved to "{no_match_fname}" for later processing')

## Find missing sentences

There is a mismatch between the id's in two versions of LIA: 

- The old version of the Vardø treebank has 9848 lines (spoken_norwegian_resources/treebanks/Norwegian-NynorskLIA_old/vardoe_uio_01.conll)
- The new version has only 8838 lines

---

Sometimes the wrong sentence has been saved to the new LIA partition because of multiple instances of the same text (often "ja", "jaha", "...", etc.). It is easy to spot these cases when the incremental id numbers suddenly jump over a bunch of numbers before going back to its incremental series, e.g. "id = 183",  "id= 660" then "id = 185", etc. 

In these cases, just delete the wrong sentences from the `lia_{partition}.conllu` file and find them again with the troubleshooting cells here.

---

**NB**: Sentences that are just punctuation will be skipped:  `# text = - … .`

This is the case for 3 sentences in `austevoll_uib_01.conll`
- 003752
- 003773
- 003778
- 005026 --> This one is a merge of `flakstad_uib_04` id `235` ("ja") and `236` ("...")
- 001476 --> This one no longer exists in the new version of LIA, but had the id 282 in the old version. Would have been between ids 271 and 272 in the new LIA. 
- > TODO: Note that these sentences are skipped in the README when we publish LIA again

--- 

There is one instance of two sentences being merged in the old UD LIA: 

UD `sent_id = 004443` corresponds to LIA `austevoll_uib_04.conll` id `80` and `81`. 

Instead of re-indexing all of UD and give new sentence ID's to all following LIA segments, we remove the segment before (which is a single word, "nei", and which there are more examples of elsewhere in the treebank) to reuse its sent_id `004442`. 

---

Number of segments in each partition after I've found the missing ones: 
- LIA dev: 878
- LIA test: 956
- LIA train: 3411
- **Total: 5245**

In comparison, the old UD LIA treebank contains 5250 segments.

In [None]:
def count_segments(partition: str) -> int:
    new_LIA_UD_file = Path(f"../data/lia_{partition}.conllu")
    lia_partition = parse(new_LIA_UD_file.read_text())
    indexed_LIA = {s.metadata["sent_id"]: s for s in lia_partition}
    print(f"LIA {partition}: {len(indexed_LIA)}")
    return len(indexed_LIA)


segments = 0
for partition in ("dev", "test", "train"):
    segments += count_segments(partition)

print(f"Total: {segments}")

---

In [None]:
# Load the old LIA UD partition and the new LIA UD partition
partition = "train"
ud_partition = load_partition(UD_path, partition=partition)
new_LIA_UD_file = Path(f"../data/lia_{partition}.conllu")
lia_partition = parse(new_LIA_UD_file.read_text())

### Index the treebanks with sent_ids

In [None]:
# Index and filter the two treebanks to get ids for the missing sentences
indexed_UD = {s.metadata["sent_id"]: s for s in ud_partition}
print(f"length for UD {partition}: {len(indexed_UD)}")

indexed_LIA = {s.metadata["sent_id"]: s for s in lia_partition}
partition_order = [int(sent_id) for sent_id in indexed_LIA.keys()]
print(f"length for LIA {partition}: {len(lia_partition)}")

missing_sents = set(indexed_UD.keys()).difference(set(indexed_LIA))
missing = {
    sent_id: indexed_UD[sent_id]
    for sent_id in missing_sents
    if not indexed_UD[sent_id].metadata["text"] == "- … ."
}  # Filter out the segments without any syntactic or semantic information
missing_sents = sorted(list(missing.keys()), key=int)

found = {}

print(f"{len(missing)} sentences in the UD {partition} were not found in LIA: ")
for s in missing.values():
    print(s.metadata)

### Fetch missing UD sentence

In [None]:
# Fetch one of the sentences from the old UD
from pprint import pprint

sent_id = missing_sents[0]
sent_number = int(sent_id)
UD_sentence = missing[sent_id]
pprint(UD_sentence.metadata)


### Find the corresponding sentence in LIA

In [None]:
filestem = UD_sentence.metadata["speakerid"]
# filestem_map = {"lierne_uio_01": "nordli_uio_01"} # Map some of the filestems from the old to the new
# filestem = filestem_map.get(filestem, filestem)
LIA_sentences = load_lia_sentences(filestem, use_old=False)  ### CHANGE `use_old`
# if the resulting sentence doesn't match the previous cell's output

before = str(sent_number - 1).zfill(6)
after = str(sent_number + 1).zfill(6)

context_id = before  ### CHANGE before / after
context_sentence = indexed_LIA[context_id]
lia_id = context_sentence.metadata["id"]

LIA_sent = LIA_sentences[int(lia_id)]  ## CHANGE Add / subtract to get the right index
print(
    (
        f"Sentence before/after the missing one ({sent_id}), "
        f"with lia id {lia_id} and UD sent_id {context_sentence.metadata['sent_id']}: "
    )
)
print(context_sentence.metadata)
print(f"Sentence from {filestem} with the corresponding index number: ")
print(LIA_sent.metadata)

In [None]:
# \+\s+"
# ,
# "

### Check for duplicate ids in the LIA files

In [None]:
from collections import Counter

c = Counter([sent.metadata["id"] for sent in LIA_sentences])
c

## Add found sentence to the LIA partition and save to file

In [None]:
# add the correct sentence id and save it to the "found" dict
LIA_sent.metadata["sent_id"] = sent_id
found[sent_id] = LIA_sent

# insert the missing sentence at the right place
if sent_number not in partition_order:
    partition_order.append(sent_number)
partition_order = sorted(partition_order)
idx = partition_order.index(sent_number)
if lia_partition[idx].metadata["sent_id"] != sent_id:
    lia_partition.insert(idx, found[sent_id])
print(lia_partition[idx].metadata)

# Remove from "missing"
for sent_id in found:
    if sent_id in missing:
        del missing[sent_id]
        missing_sents.remove(sent_id)

print(f"Still missing {len(missing)}: {[k for k in missing.keys()]}")

# Save to file
print(f"LIA {partition} length: {len(lia_partition)}")
save_conllu(lia_partition, new_LIA_UD_file)