# VLSP Dataset Preparation

This notebook is a walkthrough of our preprocessing step for the VLSP2020 Relation Extraction dataset (view more [here](https://vlsp.org.vn/vlsp2020/eval/re)). Preprocessing this dataset is quite a challenge due to many points that need fixing in the annotation. Additionally, to meet the best setting for the experiment, we propose two different input formats. At the end of this step, a folder named `vlsp_preprocessed` is created with two subfolders for two input formats, each containing three datasets (train, dev, and test) as json files.

## Preparing Environment

PhoBERT requires Vietnamese sentences to be word-segmented by VnCoreNLP in advanced. Therefore, we need to set up the model and load it for later use.

In [1]:
import py_vncorenlp
import os

vncorenlp_dir = "D:/Projects/albert-imdb/vncorenlp"
project_dir = "D:/Projects/albert-imdb"

# uncomment this line if VnCoreNLP has not been downloaded
# py_vncorenlp.download_model(save_dir=vncorenlp_path)

# load VnCoreNLP
vncorenlp_model = py_vncorenlp.VnCoreNLP(save_dir=vncorenlp_dir, annotators=["wseg"])

# change directory back to the project
os.chdir(project_dir)

## Preprocessor Implementation

The whole process includes two main stages:

1. **Loading stage**: The preprocessor loads all tsv files in a directory and processes each at a time as a dataframe. For convenience, we rename the columns as `ann_idx`, `range`, `word`, `var`, `entity`, `relation`, and `rel_heads`. Next, we extract sentences denoting roughly by a word ending with `.`, `?`, or `!`. Finally, for each relation detected in the dataframe, we save it with the entities and their ranges in attachment of the corresponding sentence.

2. **Formatting stage**: For each relation, the preprocessor adds one more sample to `preprocessor.sentences`. The sentence will be formatted based on the format code provided (view our report for more details). If the `run_vncorenlp_wseg` is on, the sentence will continue to be segmented. For the special tokens inserted into the sentence beforehand, we simply substitute the segmented tokens with the original ones.

After executing all, there is also an optional stage of shuffling the model-ready samples.

There are 7 annotation errors (some might not actually be errors but here we use "error" as a general term for all mentioned circumstances) that needs fixing:

1. **Entity as a subword**: Normally, the annotation splits each word into each row by space. However, when a word is followed immediately by a punctuation and that word is a part of an entity, the annotator will duplicate that word into the next row. This sometimes happens with a "word" actually containing multiple words and the entity word is a part of that.
```
# VLSP2020_RE_train\23357000.conll
1-173	807-810	văn	*[8]	ORGANIZATION[8]	AFFILIATION	1-166[7_8]
1-174	811-816	phòng	*[8]	ORGANIZATION[8]	_	_
1-175	817-821	kiến	*[8]	ORGANIZATION[8]	_	_
1-176	822-826	trúc	*[8]	ORGANIZATION[8]	_	_
1-177	827-833	1+1>2;	_	_	_	_
1-177.1	827-832	1+1>2	*[8]	ORGANIZATION[8]	AFFILIATION	1-166[7_8]
```

2. **Not separated by space**: There are some cases that the annotator left two words or more with spaces in between in a row and an entity part is in that chunk (normally at the end or beginning). In order to make the data meaningful, we have to separate that entity part out. However, the difficult part is that we cannot detect which part belongs to an entity. Until the time writing this, we have only examined one file with this error. However, that file is also heavily corrupted with many other problems so we decided to ignore the whole file. Note that there are cases similar in structure but the entity part is duplicated into the next row. Those cases count as the first error.
```
# VLSP2020_RE_train\23352816.conll
1-153	756-763	. VMISS	*	ORGANIZATION	_	_	_
```

3. **Inter-sentence relations**: The task is intra-sentence but there are some relations between entities belong to different sentences. Therefore, we have to get rid of those relations. This is addressed by checking if the index of the pointed entity is in the same sentence with the origin entity.

4. **Relation annotation not only in the first row of the entity**: If an entity is annotated with a relation, that relation should be inserted to the first row of that entity. But there are some cases, such as when getting the entity with punctuation error (see the example in that case), the entity annotation is interrupted and then continue with reinserting the relation.

5. **Relation not linking to the first word of the entity**: A relation should be linked to the first word of the other entity but when the other entity annotation is interrupted by the first error, the relation is linked to the row of the duplicated word with no punctuation. Due to this error, when finding the range of the entities, we have to go up and down the dataframe from the referenced index to check it.
```
# VLSP2020_RE_train\23356574.conll
1-742	3340-3346	chuyên	*[18]	ORGANIZATION[18]	_	_
1-743	3347-3354	nghiệp,	_	_	_	_
1-743.1	3347-3353	nghiệp	*[18]	ORGANIZATION[18]	PART – WHOLE	1-733[17_18]
1-744	3355-3357	Sở	*[19]	ORGANIZATION[19]	AFFILIATION|PART – WHOLE	1-733[17_19]|1-743.1[18_19]
1-745	3358-3363	GD-ĐT	*[19]	ORGANIZATION[19]	_	_
```

6. **Miscellaneous entities and relations**: Some places in a dataframe, the annotation only detect there is some entity or relation but not specify which that is. For convenience, we ignore all miscellaneous entities and relations, as well as relations involving miscellaneous entities.
```
# VLSP2020_RE_train\23366765.conll
1-6	30-35	Apple	*[1]	MISCELLANEOUS[1]	
1-7	36-39	Pay	*[1]	MISCELLANEOUS[1]	
# VLSP2020_RE_train\23351515.conll
1-318	1372-1376	tỉnh	*[19]	LOCATION[19]	PART – WHOLE|LOCATED|PART – WHOLE|*	1-315[18_19]|1-301[15_19]|1-307[16_19]|1-310[17_19]	
1-319	1377-1381	Bình	*[19]	LOCATION[19]	_	_	
1-320	1382-1386	Định	*[19]	LOCATION[19]	_	_	
```

7. **Same name for different entities**: Normally, if there are two or more entities with the same type, there will be an index after the entity type to denote different entities. This error occurs when this practice is broken. Initially, we decided to mark all instances of the same entity in an input sentence. Because of this error, we could only mark the two instances involved in the examined relation. To explain this further, we cannot specify if two words in a sentence is the same entity solely based on its appearance, there would be a lot of problems that could happen and a lot more work to prevent those.
```
# VLSP2020_RE_train\23351316.conll
1-646	2909-2913	Công	_	_	_	_	
1-647	2914-2916	ty	_	_	_	_	
1-648	2917-2920	con	_	_	_	_	
1-649	2921-2926	Waymo	*	ORGANIZATION	_	_	
1-650	2927-2930	của	_	_	_	_	
1-651	2931-2939	Alphabet	*	ORGANIZATION	PART – WHOLE	1-649	
1-652	2940-2943	thì	_	_	_	_	
1-653	2944-2948	đang	_	_	_	_	
```

Regarding this error, there are some cases really hard to handle that we decide to ignore, e.g.:
```
# VLSP2020_RE_train\23351965.conll
1-320	1419-1427	Brussels	*	LOCATION	PART – WHOLE	1-321	
1-321	1428-1431	(Bỉ	*	LOCATION	_	_	
1-322	1432-1434	).	_	_	_	_	
```

In details, our algorithm detects the range of an entity by a contiguous sequence of rows having the same value in the `entity` column. In the case above, semantically, we as humans can detect that `Brussels` and `Bỉ` are different entities. However, we cannot know for sure if there are any cases that are not as clear as this. We also cannot solve it by only detect an entity by going down from the referenced index because of the fifth error. We call this **self-relation** error. We believe this error could be resolved in future work.

In [15]:
import pandas as pd
import numpy as np

"""
format_code:

0 for <entity1> <sep> <entity2> <sep> <sentence>
where all instances of <entity1> and <entity2> are replaced by the corresponding tokens in <sentence>.

1 for open tag and close tag for each entity are inserted into the sentence.
"""
class VlspPreprocessor:
  special_token = {
    # for format code 0
    "SEP": "<sep>",
    "PERSON[1]": "<person1/>",
    "PERSON[2]": "<person2/>",
    "ORGANIZATION[1]": "<organization1/>",
    "ORGANIZATION[2]": "<organization2/>",
    "LOCATION[1]": "<location1/>",
    "LOCATION[2]": "<location2/>",
    
    # for format code 1
    "<PERSON>": "<person>",
    "</PERSON>": "</person>",
    "<ORGANIZATION>": "<organization>",
    "</ORGANIZATION>": "</organization>",
    "<LOCATION>": "<location>",
    "</LOCATION>": "</location>",
  }
  
  eos_punctuation = [".", "?", "!"] # need updating end of sentence punctuations to be more legit
  
  def __init__(self, drop_no_relation_samples: bool=True, format_code: {0, 1}=0, run_vncorenlp_wseg: bool=False):
    self.dataset = []
    
    self.label2id = {}
    self.id2label = {}
    self.drop_no_relation_samples = drop_no_relation_samples
    
    self.unformatted_offset = 0
    self.format_code = format_code
    self.run_vncorenlp_wseg = run_vncorenlp_wseg
    
    self.sentences = np.array([])
    self.labels = np.array([], dtype="uint8")
    
    self.self_relations = [] # for debugging error 7
    
  def __len__(self) -> int:
    return len(self.sentences)
  
  def execute_all(self, src_dir: str, drop_no_relation_samples: bool=None, format_code: {0, 1}=None, run_vncorenlp_wseg: bool=None, shuffle: int=None):
    print(f"Executing {src_dir}...")
    self.load(src_dir, drop_no_relation_samples=drop_no_relation_samples)
    print("Done loading.")
    self.format(format_code=format_code, run_vncorenlp_wseg=run_vncorenlp_wseg)
    print("Done formatting.")
    if shuffle is not None:
      self.shuffle(shuffle)
      print(f"Done shuffling with seed {shuffle}.")
    print("✅ Done all.")
    
  def load(self, src_dir: str, drop_no_relation_samples: bool=None): # this function is designed to execute many times on many directories
    if drop_no_relation_samples is not None:
      self.drop_no_relation_samples = drop_no_relation_samples
    
    for root, _, files in os.walk(src_dir):
      for file in files: # currently the structure is one tsv file per subfolder, but this loop is in case there are more
        if os.path.join("VLSP2020_RE_train", "23352816.conll") in root: # the tsv file in this subfolder is heavily corrupted, just ignore it for now
          continue
        if file.endswith(".tsv"):
          self.root = root # for debugging
          self.process_tsv(os.path.join(root, file))

    if self.drop_no_relation_samples:
      self._drop_no_relation_samples()
    
    self._build_id2label()
  
  def format(self, format_code: {0, 1}=None, run_vncorenlp_wseg: bool=None):
    if format_code is not None:
      self.format_code = format_code
    if run_vncorenlp_wseg is not None:
      self.run_vncorenlp_wseg = run_vncorenlp_wseg
    
    self._format(self.unformatted_offset)
    if run_vncorenlp_wseg:
      self._run_vncorenlp_wseg(self.unformatted_offset)
    self.unformatted_offset = len(self.dataset)

  def process_tsv(self, tsv_dir: str):
    df = pd.read_csv(tsv_dir, sep="\t", comment="#", quotechar="\t", header=None)
    if len(df.columns) < 8: # relation columns are missing
      return
    df.columns = ["ann_idx", "range", "word", "var", "entity", "relation", "rel_heads", "-"] # because of the meaningless \t at the end of each line, we need one more column "-"

    self._handle_word_with_entity_subword(df)

    dataset_offset = len(self.dataset)
    self._extract_sentences(df)
    # self._extract_entities(df, dataset_offset) # because of changes due to error 7, no need to keep track of entities anymore
    self._extract_relations(df, dataset_offset)

  def _handle_word_with_entity_subword(self, df):
    error_indices = df[df["ann_idx"].shift(-1).apply(lambda i: i is not None and ".1" in str(i))]["word"].index
    offset = 0
    for idx in error_indices:
      idx += offset
      entity_word = df.iloc[idx + 1]["word"]
      prefix, suffix = df.iloc[idx]["word"].split(entity_word, 1)

      if idx > 0 and df.iloc[idx - 1]["entity"] == df.iloc[idx + 1]["entity"]:
        df.loc[idx + 1, "relation"] = "_"
        df.loc[idx + 1, "rel_heads"] = "_"

      if suffix != "":
        df.loc[idx + 1.5] = ["_", "_", suffix, "_", "_", "_", "_", np.nan]
        offset += 1

      if prefix != "":
        df.loc[idx, "word"] = prefix
      else:
        df = df.drop(idx)
        offset -= 1

      df = df.sort_index().reset_index(drop=True)

  def _extract_sentences(self, df):
    sentence = []
    for word in df["word"].values:
      word = str(word) #  in VLSP2020_RE_dev/23352623.conll, the word "nan" counts as a float ¯\_(ツ)_/¯
      sentence.append(word)
      if word[-1] in VlspPreprocessor.eos_punctuation:
        self.dataset.append({ "word_list": np.array(sentence) })
        sentence = []
    self.dataset.append({ "word_list": np.array(sentence) })

  def _extract_entities(self, df, dataset_offset: int): # deprecated
    offset = 0
    sample_idx = dataset_offset
    entity_df = df[df["entity"] != "_"]["entity"]
    for entity, idx in zip(entity_df.values, entity_df.index):
      while idx >= offset + len(self.dataset[sample_idx]["word_list"]):
        offset += len(self.dataset[sample_idx]["word_list"])
        sample_idx += 1

      if "entities" not in self.dataset[sample_idx]:
        self.dataset[sample_idx]["entities"] = {} # set of entity variables

      if entity not in self.dataset[sample_idx]["entities"]\
        and "MISCELLANEOUS" not in entity: # error 6, ignore miscellaneous entities
        self.dataset[sample_idx]["entities"].update(entity)

  def _extract_relations(self, df, dataset_offset: int):
    offset = 0
    sample_idx = dataset_offset
    relation_df = df[df["relation"] != "_"][["relation", "entity", "rel_heads"]]    
    for (relations, entity, rel_heads), idx in zip(relation_df.values, relation_df.index):
      if "MISCELLANEOUS" in entity: # error 6, ignore miscellaneous entity
        continue
      
      relations = relations.split("|")
      rel_heads = rel_heads.split("|")
        
      while idx >= offset + len(self.dataset[sample_idx]["word_list"]):
        offset += len(self.dataset[sample_idx]["word_list"])
        sample_idx += 1
      sample = self.dataset[sample_idx]
      entity_range = self._find_range(df, idx, offset=offset)
      entity = entity.split("[")[0]
      
      for i in range(len(relations)):
        if relations[i] == "*": # error 6, ignore miscellaneous relations in VLSP2020_RE_train\23351515.conll and 23351856.conll
          continue

        other_entity_df = df[df["ann_idx"] == rel_heads[i].split("[")[0]]["entity"]
        other_entity = other_entity_df.values[0].split("[")[0]
        other_entity_idx = other_entity_df.index[0]
        
        if "MISCELLANEOUS" in other_entity: # error 6, ignore miscellaneous entity
          continue
        if offset > other_entity_idx or other_entity_idx >= offset + len(sample["word_list"]): # error 3, ignore inter-sentence relations
          continue
        
        if "relations" not in sample:
          sample["relations"] = []
        
        other_entity_range = self._find_range(df, other_entity_idx, offset=offset)
        
        if entity_range[0] == other_entity_range[0] and  entity_range[1] == other_entity_range[1]: # error 7, ignore and save to debug log
          self.self_relations.append((relations[i], rel_heads[i], self.root))
          continue
        
        assert entity_range[1] <= other_entity_range[0] or other_entity_range[1] <= entity_range[0]
        sample["relations"].append((entity, entity_range, other_entity, other_entity_range, relations[i]))

        if relations[i] not in self.label2id: # update label2id mapping
          self.label2id[relations[i]] = len(self.label2id)
          
  def _find_range(self, df: pd.DataFrame, idx: int, offset: int=0) -> tuple[int, int]:
    entity = df.iloc[int(idx)]["entity"]
    x = int(idx)
    while x > 0 and df.iloc[x - 1]["entity"] == entity:
      x -= 1
    y = int(idx) + 1
    while y < df.shape[0] and df.iloc[y]["entity"] == entity:
      y += 1
    assert x - offset < y - offset
    return (x - offset, y - offset)

  def _drop_no_relation_samples(self, dataset_offset: int=0):
    clean_dataset = []
    for sample in self.dataset[dataset_offset:]:
      if "relations" in sample:
        clean_dataset.append(sample)
    self.dataset = [*self.dataset[:dataset_offset], *clean_dataset]
  
  def _build_id2label(self):
    self.id2label = {v: k for k, v in self.label2id.items()}
  
  def _format(self, dataset_offset: int):
    for sample in self.dataset[dataset_offset:]:
      if "relations" not in sample:
        continue
      
      for ent1, ent1_range, ent2, ent2_range, rel_type in sample["relations"]:
        self.labels = np.append(self.labels, self.label2id[rel_type])
        sentence = np.array([])
        
        if self.format_code == 0:
          ent2 = f"{ent2}[{2 if ent1 == ent2 else 1}]"
          ent1 = f"{ent1}[1]"
          sentence = np.append(sentence, [
            VlspPreprocessor.special_token[ent1],
            VlspPreprocessor.special_token["SEP"],
            VlspPreprocessor.special_token[ent2],
            VlspPreprocessor.special_token["SEP"],
          ])
        
        if ent1_range[0] >= ent2_range[1]:
          ent1, ent1_range, ent2, ent2_range = ent2, ent2_range, ent1, ent1_range
        assert ent1_range[1] <= ent2_range[0]
        
        if self.format_code == 0:
          sentence = np.concatenate([
            sentence,
            sample["word_list"][:ent1_range[0]],
            [VlspPreprocessor.special_token[ent1]],
            sample["word_list"][ent1_range[1]:ent2_range[0]],
            [VlspPreprocessor.special_token[ent2]],
            sample["word_list"][ent2_range[1]:]
          ])
        elif self.format_code == 1:
          sentence = np.concatenate([
            sample["word_list"][:ent1_range[0]],
            [VlspPreprocessor.special_token[f"<{ent1}>"]],
            sample["word_list"][ent1_range[0]:ent1_range[1]],
            [VlspPreprocessor.special_token[f"</{ent1}>"]],
            sample["word_list"][ent1_range[1]:ent2_range[0]],
            [VlspPreprocessor.special_token[f"<{ent2}>"]],
            sample["word_list"][ent2_range[0]:ent2_range[1]],
            [VlspPreprocessor.special_token[f"</{ent2}>"]],
            sample["word_list"][ent2_range[1]:]
          ])
        else:
          raise ValueError("Preprocessor is using some non-predefined format code.")
        
        self.sentences = np.append(self.sentences, " ".join(sentence))
        
    assert len(self.sentences) == len(self.labels)

  def _run_vncorenlp_wseg(self, dataset_offset: int=0):
    def sentence_transform(s: str):
      s = vncorenlp_model.word_segment(s)[0]
      for _, token in VlspPreprocessor.special_token.items():
        wseg_token = vncorenlp_model.word_segment(token)[0]
        s = s.replace(wseg_token, token)
      return s

    array_transform = np.vectorize(sentence_transform)
    self.sentences = np.concatenate([
      self.sentences[:dataset_offset],
      array_transform(self.sentences[dataset_offset:])
    ])
  
  def shuffle(self, seed: int=42):
    np.random.seed(seed)
    mask = np.random.permutation(len(self.sentences))
    assert len(self.sentences) == len(self.labels)
    self.sentences = self.sentences[mask]
    self.labels = self.labels[mask]
  
  def clear(self):
    self.dataset.clear()
    self.label2id.clear()
    self.id2label.clear()
    self.reset_format()
    
  def reset_format(self):
    self.unformatted_offset = 0
    self.sentences = np.array([])
    self.labels = np.array([], dtype="uint8")
    

## Execution

The original test set from VLSP does not provide relation labels. Therefore, we decided to use the given dev set as the test set and split the given train set into new train and dev sets. The seed used for shuffling is `42` in all cases.

We prepare a procedure to save data in a `jsonl` file.

In [16]:
import jsonlines

def save_data(dir, sentences, labels):
  with jsonlines.open(dir, mode="w") as writer:
    writer.write_all([{ "sentence": x, "label": int(y) } for x, y in zip(sentences, labels)])

Load all documents in the provided training and development folder:

In [17]:
preprocessor = VlspPreprocessor()
preprocessor.load(os.path.join("VLSP2020", "VLSP2020_RE_train"))
preprocessor.load(os.path.join("VLSP2020", "VLSP2020_RE_dev"))

Run and save datasets without VnCoreNLP word segmentation:

In [18]:
preprocessor.format(format_code=0, run_vncorenlp_wseg=False)
preprocessor.shuffle(42)

n = len(preprocessor.sentences)
train_set_format0 = preprocessor.sentences[:int(.8 * n)]
dev_set_format0 = preprocessor.sentences[int(.8 * n):int(.9 * n)]
test_set_format0 = preprocessor.sentences[int(.9 * n):]

train_labels = preprocessor.labels[:int(.8 * n)]
dev_labels = preprocessor.labels[int(.8 * n):int(.9 * n)]
test_labels = preprocessor.labels[int(.9 * n):]

save_data(
  os.path.join("vlsp_preprocessed", "format0_nowseg", "train.jsonl"),
  train_set_format0, train_labels
)
save_data(
  os.path.join("vlsp_preprocessed", "format0_nowseg", "dev.jsonl"),
  dev_set_format0, dev_labels
)
save_data(
  os.path.join("vlsp_preprocessed", "format0_nowseg", "test.jsonl"),
  test_set_format0, test_labels
)

In [19]:
preprocessor.reset_format()
preprocessor.format(format_code=1, run_vncorenlp_wseg=False)
preprocessor.shuffle(42)

train_set_format1 = preprocessor.sentences[:int(.8 * n)]
dev_set_format1 = preprocessor.sentences[int(.8 * n):int(.9 * n)]
test_set_format1 = preprocessor.sentences[int(.9 * n):]

save_data(
  os.path.join("vlsp_preprocessed", "format1_nowseg", "train.jsonl"),
  train_set_format1, train_labels
)
save_data(
  os.path.join("vlsp_preprocessed", "format1_nowseg", "dev.jsonl"),
  dev_set_format1, dev_labels
)
save_data(
  os.path.join("vlsp_preprocessed", "format1_nowseg", "test.jsonl"),
  test_set_format1, test_labels
)

Run and save datasets with VnCoreNLP word segmentation enabled:

In [20]:
preprocessor.reset_format()
preprocessor.format(format_code=0, run_vncorenlp_wseg=True)
preprocessor.shuffle(42)

train_set_format0 = preprocessor.sentences[:int(.8 * n)]
dev_set_format0 = preprocessor.sentences[int(.8 * n):int(.9 * n)]
test_set_format0 = preprocessor.sentences[int(.9 * n):]

save_data(
  os.path.join("vlsp_preprocessed", "format0", "train.jsonl"),
  train_set_format0, train_labels
)
save_data(
  os.path.join("vlsp_preprocessed", "format0", "dev.jsonl"),
  dev_set_format0, dev_labels
)
save_data(
  os.path.join("vlsp_preprocessed", "format0", "test.jsonl"),
  test_set_format0, test_labels
)

In [21]:
preprocessor.reset_format()
preprocessor.format(format_code=1, run_vncorenlp_wseg=True)
preprocessor.shuffle(42)

train_set_format1 = preprocessor.sentences[:int(.8 * n)]
dev_set_format1 = preprocessor.sentences[int(.8 * n):int(.9 * n)]
test_set_format1 = preprocessor.sentences[int(.9 * n):]

save_data(
  os.path.join("vlsp_preprocessed", "format1", "train.jsonl"),
  train_set_format1, train_labels
)
save_data(
  os.path.join("vlsp_preprocessed", "format1", "dev.jsonl"),
  dev_set_format1, dev_labels
)
save_data(
  os.path.join("vlsp_preprocessed", "format1", "test.jsonl"),
  test_set_format1, test_labels
)

KeyboardInterrupt: 

## Results

The length of each dataset:

In [22]:
print(f"Number of training samples: {len(train_labels)}")
print(f"Number of development samples: {len(dev_labels)}")
print(f"Number of testing samples: {len(test_labels)}")

Number of training samples: 3228
Number of development samples: 404
Number of testing samples: 404


Some of the samples in format 0:

In [23]:
dev_set_format0[:10]

array(['<location1/> <sep> <person1/> <sep> Tổng_Lãnh_sự <location1/> tại Việt_Nam - bà <person1/> - cho biết , bà rất thích ăn trái thanh_long Việt_Nam và bà vui_mừng khi gia_đình , bạn_bè bà ở Úc rồi_đây cũng được thưởng_thức loại trái_cây tuyệt_vời này của Việt_Nam .',
       '<location1/> <sep> <location2/> <sep> Nạn_nhân trong vụ án này là bà Phạm_Thị_Ngọc_Diệp ( SN 1978 , trú <location2/> , <location1/> ) , là giáo_viên Trường THCS Chu_Văn_An , thị_trấn Chư_Sê , đồng_thời là vợ của ông Giáp_Bá_Dự - Chánh toà hình_sự TAND tỉnh Gia_Lai .',
       '<location1/> <sep> <location2/> <sep> Khi đó , trên đường có xe đầu kéo 77 C-143 . 82 kéo sơ_mi rơ moóc 77 R - 022.74 do anh Hồ_Thanh_Lợi ( 38 tuổi , ngụ tổ 8 , <location2/> , <location1/> , tỉnh Bình_Định ) điều_khiển đang đi theo hướng Bắc-Nam.',
       '<location1/> <sep> <person1/> <sep> <person1/> là nữ ca_sĩ trẻ triển_vọng của làng âm_nhạc <location1/> .',
       '<organization1/> <sep> <person1/> <sep> Còn chị <person1/> là công_nh

Some of the samples in format 1:

In [24]:
dev_set_format1[:10]

array(['Tổng Lãnh sự <location> Úc </location> tại Việt Nam - bà <person> Karen Lanyon </person> - cho biết, bà rất thích ăn trái thanh long Việt Nam và bà vui mừng khi gia đình, bạn bè bà ở Úc rồi đây cũng được thưởng thức loại trái cây tuyệt vời này của Việt Nam .',
       'Nạn nhân trong vụ án này là bà Phạm Thị Ngọc Diệp (SN 1978, trú <location> phường Tây Sơn </location> , <location> TP Pleiku </location> ), là giáo viên Trường THCS Chu Văn An , thị trấn Chư Sê , đồng thời là vợ của ông Giáp Bá Dự -Chánh tòa hình sự TAND tỉnh Gia Lai .',
       'Khi đó, trên đường có xe đầu kéo 77C-143.82 kéo sơ mi rơ moóc 77R- 022.74 do anh Hồ Thanh Lợi (38 tuổi, ngụ tổ 8 , <location> phường Bùi Thị Xuân </location> , <location> TP.Quy Nhơn </location> , tỉnh Bình Định ) điều khiển đang đi theo hướng Bắc-Nam.',
       '<person> Võ Kiều Vân </person> là nữ ca sĩ trẻ triển vọng của làng âm nhạc <location> Việt Nam </location> .',
       'Còn chị <person> Tuyết </person> là công nhân của <organizati

Statistics by labels in train set:

In [25]:
print(preprocessor.label2id)
df = pd.DataFrame(pd.Series(train_labels).value_counts(), columns=["# samples"])
df["Ratio (%)"] = df["# samples"] / df["# samples"].sum() * 100
df

{'AFFILIATION': 0, 'LOCATED': 1, 'PART – WHOLE': 2, 'PERSONAL - SOCIAL': 3}


Unnamed: 0,# samples,Ratio (%)
2,1335,41.356877
0,987,30.576208
1,754,23.358116
3,152,4.708798


Statistics by labels in dev set:

In [None]:
print(preprocessor.label2id)
df = pd.DataFrame(pd.Series(dev_labels).value_counts(), columns=["# samples"])
df["Ratio (%)"] = df["# samples"] / df["# samples"].sum() * 100
df

{'AFFILIATION': 0, 'LOCATED': 1, 'PART – WHOLE': 2, 'PERSONAL - SOCIAL': 3}


Unnamed: 0,# samples,Ratio (%)
2,154,38.118812
0,135,33.415842
1,82,20.29703
3,33,8.168317


Statistics by labels in test set:

In [None]:
print(preprocessor.label2id)
df = pd.DataFrame(pd.Series(test_labels).value_counts(), columns=["# samples"])
df["Ratio (%)"] = df["# samples"] / df["# samples"].sum() * 100
df

{'AFFILIATION': 0, 'LOCATED': 1, 'PART – WHOLE': 2, 'PERSONAL - SOCIAL': 3}


Unnamed: 0,# samples,Ratio (%)
2,173,42.821782
0,128,31.683168
1,92,22.772277
3,11,2.722772


All self-relation errors the preprocessor detected:

In [14]:
preprocessor.self_relations

[('PART – WHOLE', '1-321', 'VLSP2020\\VLSP2020_RE_train\\23351965.conll'),
 ('PART – WHOLE', '1-161', 'VLSP2020\\VLSP2020_RE_train\\23352701.conll'),
 ('PART – WHOLE', '1-885', 'VLSP2020\\VLSP2020_RE_train\\23352753.conll'),
 ('PART – WHOLE', '1-41', 'VLSP2020\\VLSP2020_RE_train\\23353786.conll'),
 ('PART – WHOLE', '1-390', 'VLSP2020\\VLSP2020_RE_train\\23353891.conll'),
 ('PART – WHOLE', '1-350', 'VLSP2020\\VLSP2020_RE_train\\23354619.conll'),
 ('PART – WHOLE', '1-423', 'VLSP2020\\VLSP2020_RE_train\\23354619.conll'),
 ('PART – WHOLE', '1-1375', 'VLSP2020\\VLSP2020_RE_train\\23354880.conll'),
 ('PART – WHOLE', '1-151', 'VLSP2020\\VLSP2020_RE_dev\\23352337.conll'),
 ('PART – WHOLE', '1-393', 'VLSP2020\\VLSP2020_RE_dev\\23352396.conll'),
 ('PART – WHOLE', '1-197', 'VLSP2020\\VLSP2020_RE_dev\\23352445.conll'),
 ('PART – WHOLE', '1-741', 'VLSP2020\\VLSP2020_RE_dev\\23352491.conll'),
 ('AFFILIATION', '1-208', 'VLSP2020\\VLSP2020_RE_dev\\23352585.conll')]