## Character Relationships Workflow

#### 1. Split into chapters

First, the book is split into chapters as the memory requirement for analysing the entire book at once is too large. 

In our case, the key delimiter to split is 6 consecutive newline characters `\n` and the title of the chapter is taken as the first line of each block of text. The chapter splitting may not be perfect and some manual adjustments may be necessary 

In [None]:
import os

from chapter_splitter import split_chapters

text_dir = 'text'
book = 'worm'

split_chapters(text_dir, book)

#### 2. Conduct Named Entity Recognition and Coreference using BookNLP 

The BookNLP library was used to conduct Named Entity Recognition and Coreference (matching characters and tracking when they are present even if pronouns are used).

To install BookNLP, run the following on CLI.
```bash
pip install booknlp
python -m spacy download en_core_web_sm
```

In [None]:
from ner_coref import run_ner_coref

ner_coref_data_dir = os.path.join('booknlp_output', book)
text_dir = os.path.join('text', book)

run_ner_coref(text_dir, ner_coref_data_dir)

#### 3. Searching and Matching Main Characters

BookNLP provides a list of characters in each of the chapters. The next task is to match them across the chapters for the entire text.

Firstly, BookNLP refers to tries to refer to the identified characters with proper nouns if possible, (followed by a common noun, then a pronoun). The nouns were matched across the chapters.

In [None]:
from get_main_char import get_main_char

characters_data_dir = os.path.join('characters', book)

get_main_char(ner_coref_data_dir, characters_data_dir)

Now, a bit of manual work is needed. Some characters have aliases or titles, and sometimes they are referred to with or without their surnames. The more common nouns were tracked and matched based on knowledge of the text (having read the book helps).

The most common characters can be viewed from the `main_characters.json` file in the `characters_data_dir` directory. Adjust `TOP_N_CHAR` accordingly.

In [None]:
import os
import json


TOP_N_CHAR = 200


with open(os.path.join(characters_data_dir, "main_characters.json"), 'r') as file:
    main_characters_data = json.load(file)

main_characters_data = dict(sorted(main_characters_data.items(), key = lambda x: x[1]["count"], reverse=True))

count = 0

for name, det in main_characters_data.items():
    if count == TOP_N_CHAR: 
        break

    print(name)
    count += 1

The `main_character_aliases.json` file will have to be created and stored in the `characters_data_dir`. An example  of the file is as follows:

```json
[
    ["Name of Person 0", "Alias 1 of Person 0", "Alias 2 of Person 0"],
    ["Name of Person 1", "Alias 1 of Person 1"],
    ["Name of Person 2"]
]
```

Each character can have any number of aliases, so long as it is more than 1. The strings in the list must be an exact match with the nouns in the `main_characters.json` list.

Now, the main characters can be matched across the chapters.

In [None]:
from consolidate_main_char import consolidate_main_char

consolidate_main_char(characters_data_dir)

Note: the `consolidated_id` in the newly generated `main_characters_consolidated.json` file will be the index of the character in the list manually created in `main_character_aliases.json`. That character's names and aliases are in the corresponding index.

#### 4. Fetching Relevant Sentences

The next step is to collate all sentences in the book which feature 2 or more characters, and link them to each character's `consolidated_id`.

A `relevant_sentences.csv` file is generated under the directory of each chapter in the output directory of BookNLP (`ner_coref_data_dir`).

In [None]:
from get_relevant_sentences import get_relevant_sentences_in_book

get_relevant_sentences_in_book(
    ner_coref_data_dir,
    text_dir,
    characters_data_dir
)