# Tracking works cited in the _Dai Nihon Shi_
We have a selection of chapters from the _Dai Nihon Shi_ that have been manually annotated with named entities, including people (`PER`), locations (`LOC`), and works of art (`WORK_OF_ART`). Using this selection, we want to identify some of the most frequently mentioned works of art (nearly always written works) and track their appearance across the entire _Dai Nihon Shi_.

First, let's check the works of art (tagged `WORK_OF_ART`) in our annotated data, which is stored in the CoNLL-2002 (`.conll`) format.

In [1]:
from pathlib import Path

import spacy

# load all annotated data exported from INCEpTION; make a Doc out of each chapter
docs = []
nlp = spacy.blank("lzh")
for conll_file in Path("../assets/kanbun/3_inception_export/").glob("*.conll"):
  with open(conll_file, "r") as f:
    conll_doc = f.read().strip()
    words = []
    sent_starts = []
    pos_tags = []
    biluo_tags = []
    for conll_sent in conll_doc.split("\n\n"):
        conll_sent = conll_sent.strip()
        if not conll_sent:
            continue
        lines = [line.strip() for line in conll_sent.split("\n") if line.strip()]
        cols = list(zip(*[line.split() for line in lines]))
        length = len(cols[0])
        words.extend(cols[0])
        sent_starts.extend([True] + [False] * (length - 1))
        biluo_tags.extend(spacy.training.iob_utils.iob_to_biluo(cols[-1]))
        pos_tags.extend(cols[1] if len(cols) > 2 else ["-"] * length)
    doc = spacy.tokens.Doc(
      nlp.vocab,
      words=words,
      spaces=[False] * len(words),
      user_data={"title": conll_file.stem},
    )
    for i, token in enumerate(doc):
      token.tag_ = pos_tags[i]
      token.is_sent_start = sent_starts[i]
    entities = spacy.training.iob_utils.tags_to_entities(biluo_tags)
    doc.ents = [spacy.tokens.Span(doc, start=s, end=e + 1, label=L) for L, s, e in entities]
    docs.append(doc)

# check some of the annotated entities in a doc
spacy.displacy.render(docs[0][:100], style="ent")



Let's sort the works of art by their frequency of appearance in the annotated data.

In [2]:
import collections

# get all the tagged works of art; count how many times each occurs
works_of_art = [ent for doc in docs for ent in doc.ents if ent.label_ == "WORK_OF_ART"]
top_works = dict(collections.Counter(ent.text for ent in works_of_art).most_common())

print("\n".join([f"{work}: {str(count)}" for work, count in top_works.items()]))

公卿補任: 1144
尊卑分脈: 1030
東鑑: 803
源平盛衰記: 737
一代要記: 723
日本紀略: 614
太平記: 573
皇胤紹運錄: 543
日本紀: 542
平家物語: 436
女院小傳: 379
續日本紀: 371
三代實錄: 356
歷代皇紀: 283
扶桑略記: 268
本書: 265
榮華物語: 262
增鏡: 240
今鏡: 223
玉海: 217
大鏡: 202
帝王編年記: 193
愚管抄: 180
百鍊抄: 173
文德實錄: 173
續日本後紀: 152
今昔物語: 144
皇胤系圖: 140
東國通鑑: 133
十訓抄: 130
古事談: 129
大鏡裏書: 124
華物語: 120
系圖: 113
古事記: 113
後宮略傳: 112
中右記: 112
平治物語: 102
古今著聞集: 100
園太曆: 97
保元物語: 95
小右記: 91
長門本平家物語: 90
類聚國史: 88
梅松論: 83
三國史記: 82
平氏系圖: 79
百鍊鈔: 77
保曆間記: 76
承久記: 76
陸奧話記: 76
日本後紀: 74
台記: 72
姓氏錄: 68
貴女鈔: 64
本朝文粹: 60
山槐記: 59
明月記: 51
齋院記: 50
關東評定傳: 48
結城文書: 46
諸門跡譜: 45
天正本太平記: 45
貴女抄: 43
政事要略: 43
仁和寺書籍目錄: 43
神皇正統記: 42
袋草子: 42
江談鈔: 41
續古事談: 41
新葉和歌集: 39
五代帝王物語: 38
盛衰記: 37
本書註: 36
花營三代記: 36
舊事紀: 35
愚管鈔: 35
唐書: 35
江談抄: 33
後紀: 32
足利家傳: 32
金勝院本: 30
元亨釋書: 30
朝野群載: 29
歌仙傳: 29
喜連川系圖: 28
菅家文草: 27
萬葉集: 26
記: 26
園太曆系圖: 26
紀: 26
建武二年記: 25
將軍執權次第: 25
八雲御抄: 24
水鏡: 24
紹運錄: 23
吉記: 23
八雲御鈔: 23
作者部類: 22
續往生傳: 22
後愚昧記: 21
元弘日記裏書: 21
左經記: 21
將門記: 21
懷風藻: 20
外記日記: 20
阿蘇社文書: 19
常樂記: 19
八坂本平家物語: 19
善鄰國寶記: 19
宇佐

Now, let's get track the citation of these works of art for each of the chapters in our selection and write the output to JSON so that it can be visualized.

In [15]:
import itertools
import json
import csv

# get the most cited works per chapter
works_by_doc = {
  doc.user_data["title"]: dict(collections.Counter([ent.text for ent in doc.ents]).most_common())
  for doc in docs
}

list_works_by_doc = sorted([(doc.replace("text", ""), works_by_doc[doc]) for doc in works_by_doc], key=lambda x: int(x[0]))

docs_by_work = {
  work: { doc: counts.get(work, 0) for doc, counts in list_works_by_doc } for work in top_works
}

# get all unique chapter and work titles
doc_titles = set([doc.user_data["title"] for doc in docs])
work_titles = set([work for work_list in works_by_doc.values() for work in work_list])

# create a list of matrix cells where each cell is the number of citations of a work in a chapter
output = [
  {
    "doc_title": doc.replace("text", "").replace("part", "."),
    "work_title": work,
    "count": works_by_doc[doc][work]
  }
 for doc, work in itertools.product(doc_titles, work_titles) if work in works_by_doc[doc]
]

# sort by chapter order
output.sort(key=lambda row: float(row["doc_title"]))

# write output to json for visualization
with open("cited_works_by_chapter.json", mode="w", encoding="utf-8") as f:
  json.dump(output, f, ensure_ascii=False)
print("wrote cited_works_by_chapter.json")

# write output to csv, including totals
# columns are doc titles since there are fewer of them than works cited
# also add totals for each work
# remove the "text" prefix from the doc titles and sort by chapter order
with open("cited_works_by_chapter.csv", mode="w", encoding="utf-8-sig") as f:
  fmt_titles = [str(t) for t in sorted([int(title.replace("text", "")) for title in doc_titles])]
  writer = csv.DictWriter(f, fieldnames=("work", "total", *fmt_titles))
  writer.writeheader()
  for work, doc_list in docs_by_work.items():
    writer.writerow({"work": work, "total": top_works[work], **doc_list})
print("wrote cited_works_by_chapter.csv")

wrote cited_works_by_chapter.json
wrote cited_works_by_chapter.csv
