# Tracking works cited in the _Dai Nihon Shi_
We have a selection of chapters from the _Dai Nihon Shi_ that have been manually annotated with named entities, including people (`PER`), locations (`LOC`), and works of art (`WORK_OF_ART`). Using this selection, we want to identify some of the most frequently mentioned works of art (nearly always written works) and track their appearance across the entire _Dai Nihon Shi_.

First, let's check the works of art (tagged `WORK_OF_ART`) in our annotated data, which is stored in the CoNLL-2002 (`.conll`) format.

In [5]:
from pathlib import Path

import spacy

# load all annotated data exported from INCEpTION; make a Doc out of each chapter
docs = []
nlp = spacy.blank("lzh")
for conll_file in Path("../assets/kanbun/3_inception_export/").glob("*.conll"):
  with open(conll_file, "r") as f:
    conll_doc = f.read().strip()
    words = []
    sent_starts = []
    pos_tags = []
    biluo_tags = []
    for conll_sent in conll_doc.split("\n\n"):
        conll_sent = conll_sent.strip()
        if not conll_sent:
            continue
        lines = [line.strip() for line in conll_sent.split("\n") if line.strip()]
        cols = list(zip(*[line.split() for line in lines]))
        length = len(cols[0])
        words.extend(cols[0])
        sent_starts.extend([True] + [False] * (length - 1))
        biluo_tags.extend(spacy.training.iob_utils.iob_to_biluo(cols[-1]))
        pos_tags.extend(cols[1] if len(cols) > 2 else ["-"] * length)
    doc = spacy.tokens.Doc(
      nlp.vocab,
      words=words,
      spaces=[False] * len(words),
      user_data={"title": conll_file.stem},
    )
    for i, token in enumerate(doc):
      token.tag_ = pos_tags[i]
      token.is_sent_start = sent_starts[i]
    entities = spacy.training.iob_utils.tags_to_entities(biluo_tags)
    doc.ents = [spacy.tokens.Span(doc, start=s, end=e + 1, label=L) for L, s, e in entities]
    docs.append(doc)

# check some of the annotated entities in a doc
spacy.displacy.render(docs[0][:100], style="ent")



Let's identify the 25 most frequently cited works of art from our entire annotated selection.

In [26]:
import collections

# get all the tagged works of art; count how many times each occurs
works_of_art = [ent for doc in docs for ent in doc.ents if ent.label_ == "WORK_OF_ART"]
top_works = dict(collections.Counter(ent.text for ent in works_of_art).most_common(25))

print("\n".join([f"{work}: {str(count)}" for work, count in top_works.items()]))

公卿補任: 1144
尊卑分脈: 1030
東鑑: 803
源平盛衰記: 737
一代要記: 723
日本紀略: 614
太平記: 573
皇胤紹運錄: 543
日本紀: 542
平家物語: 436
女院小傳: 379
續日本紀: 371
三代實錄: 356
歷代皇紀: 283
扶桑略記: 268
本書: 265
榮華物語: 262
增鏡: 240
今鏡: 223
玉海: 217
大鏡: 202
帝王編年記: 193
愚管抄: 180
百鍊抄: 173
文德實錄: 173


Now, let's get track the citation of these works of art for each of the chapters in our selection and write the output to JSON so that it can be visualized.

In [27]:
import itertools
import json
import csv

# get the top 10 most cited works per chapter
works_by_doc = {
  doc.user_data["title"]: dict(collections.Counter([ent.text for ent in doc.ents if ent.text in top_works]).most_common())
  for doc in docs
}

# get all unique chapter and work titles
doc_titles = set([doc.user_data["title"] for doc in docs])
work_titles = set([work for work_list in works_by_doc.values() for work in work_list])

# create a list of matrix cells where each cell is the number of citations of a work in a chapter
output = [
  {
    "doc_title": doc.replace("text", "").replace("part", "."),
    "work_title": work,
    "count": works_by_doc[doc][work]
  }
 for doc, work in itertools.product(doc_titles, work_titles) if work in works_by_doc[doc]
]

# sort by chapter order
output.sort(key=lambda row: float(row["doc_title"]))

# write output to json for visualization
with open("output.json", mode="w") as f:
  json.dump(output, f, ensure_ascii=False)

# write output to csv
with open("output.csv", mode="w") as f:
  writer = csv.DictWriter(f, fieldnames=("doc", *work_titles))
  writer.writeheader()
  for doc, work_list in works_by_doc.items():
    writer.writerow({"doc": doc.replace("text", "").replace("part", "."), **work_list})